Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses

Liu, Ziqi; Yang, Bolan; Huang, Shizhen

doi:10.3390/educsci15111443

Open AccessArticle

Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses

by

Ziqi Liu

¹

,

Bolan Yang

^2,3

and

Shizhen Huang

^3,*

¹

Institute of Education, University College London, London WC1E 6BT, UK

²

Dewey Institute, Bracebridge, ON P1L 1E2, Canada

³

College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China

^*

Author to whom correspondence should be addressed.

Educ. Sci. 2025, 15(11), 1443; https://doi.org/10.3390/educsci15111443

Submission received: 22 August 2025 / Revised: 6 October 2025 / Accepted: 22 October 2025 / Published: 28 October 2025

(This article belongs to the Special Issue ChatGPT as Educative and Pedagogical Tool: Perspectives and Prospects)

Download

Browse Figures

Versions Notes

Abstract

The rise of large language models (LLMs) offers new forms of academic support for STEM students engaged in self-directed study. This study evaluates the impacts of multiple assistance approaches on laboratory course performance, focusing on engineering students in electronics-related disciplines. A cohort of 218 students underwent a redesigned lab course, and their outcomes were compared to those of 177 students from earlier years who did not receive such support. Specifically, we implemented five types of support approaches for students completing laboratory coursework: (1) Teaching Assistant (TA) only, (2) Generic-LLM-only model, (3) Expert-tuned-LLM-only model, (4) TA + Generic LLM model, and (5) TA + Expert-tuned LLM model. Our key findings are as follows: I. Compared to the historical baseline with no support, students assisted by the generic-LLM-only model did not show a significant improvement in performance. II. Teaching assistant involvement was associated with marked improvements in student outcomes, and performance across all TA-involved approaches showed little variation. III. The expert-tuned LLM was more effective than the generic LLM in improving student outcomes. IV. The combined TA + LLM configurations enhanced learning efficiency overall, although they required greater time investment in the early stages of the course. These results highlight the promising role of LLM technologies in the future of engineering education, while also underscoring the continued importance of domain-specific expertise in delivering effective learning support.

Keywords:

large language models; electronics students; teaching assistant; expert-tune large language model

1. Introduction

Artificial intelligence (AI) technologies, particularly large language model (LLM) and generative AI, are rapidly transforming both professional workflows and educational practices. These technologies have not only revolutionized content creation and automation in fields such as programming (Copilot Wermelinger (2023)), writing (ChatGPT Imran and Almusharraf (2023)), and translation (DeepL Polakova and Klimova (2023)), but also initiated a shift in engineering education. Recent studies indicate that LLMs with reasoning capabilities can effectively support self-paced learning and problem-solving (Englmeier & Contreras, 2024; S. Ma et al., 2025). As a result, enabling students to effectively integrate these tools into their learning and experimentation processes may play a critical role in the future of engineering education.

Despite the increasing competence of senior engineering students in using LLM tools, existing studies fall short in addressing how LLMs can be incorporated into course delivery. In particular, there is a lack of regarding the collaborative roles of LLMs and traditional teaching assistants, as well as the generic LLM and expert-tuned LLM models, comparing their educational impact. Understanding these relationships is critical to advancing LLM-driven educational practices and training future generations of engineers.

This work explores not only the utility of LLMs in electronics laboratory instruction, but also the collaborative potential between LLMs and human TAs. This integration is particularly crucial for students in electrical and electronics (EE) engineering programs. We conducted a course reform involving 218 students from three EE-related majors at a science and engineering university. Students were grouped into five categories, each receiving a different form of assistant support. Final outcomes were compared against a historical baseline of students from the same majors who completed the course without LLM or TA assistance. To assess the value of domain-specific expertise, we also developed an expert-tuned LLM and used it as a reference model. Our findings demonstrate the critical role LLMs can play in improving the quality of engineering laboratory education and provide empirical evidence for the future integration of LLM technologies into STEM curricula at scale.

The remainder of this paper is organized as follows. Section 2 reviews prior studies of LLM technologies in higher education and motivates our work. Section 3 introduces the proposed multiple assistance approaches for laboratory courses, along with the expert-tuned LLM developed for this study. Section 4 presents the experimental results and corresponding analysis. Finally, Section 5 concludes the paper by summarizing the key findings, discussing current limitations, and outlining directions for future research.

2. Literature Review and Motivations

2.1. Literature Review

Integrating AI technologies into higher education holds significant potential for advancement, particularly with the emergence of LLMs. These models can enhance undergraduate students’ abilities in self-directed learning Askarbekuly and Aničić (2024), self-assessment, and content transformation Tan et al. (2024). However, integrating LLMs into higher education classrooms in a reliable and effective manner remains a significant challenge. On the technical side, issues such as hallucination Ho et al. (2024) can lead to inaccurate outputs, particularly in response to domain-specific queries. Furthermore, effective use often requires carefully engineered prompts, Jacobsen and Weber (2025), posing a barrier for students lacking experience in articulating precise domain-specific instructions. On the students side, the integration of LLMs may raise psychological concerns due to the lack of real human interaction and personalized emotional engagement (Morton, 2025; X. Zheng et al., 2025). In addition, the deployment of such models introduces a range of privacy and ethical challenges, as noted in recent studies (Mienye & Swart, 2025; Yan et al., 2024; Zhui et al., 2024). Furthermore, the convenience and accessibility of LLMs can lead to over-reliance, potentially undermining students’ critical thinking and problem-solving abilities (Lyu et al., 2024; Rossi et al., 2025). Prior research has shown that student engagement with LLM technologies varies significantly with economic background and institutional prestige. Students from wealthier families or top-tier institutions generally experience greater benefits from LLM integration, while those from under-resourced environments may face barriers to access and engagement Li et al. (2025). On the other hand, prior research have shown that providing clear and specific instructions can lead LLMs to produce more relevant and less biased outputs Razafinirina et al. (2024). Similarly, carefully crafted prompts have been found to mitigate model hallucinations and improve response fidelity (Ho et al., 2024; Taveekitworachai et al., 2024). Such accuracy is particularly crucial in educational context, Jishan et al. (2024); Kim et al. (2023); Y. Ma et al. (2024) where the correctness and relevance of generated content have a direct impact on student learning outcomes.

Integrating LLMs into higher education classrooms can take various forms. Among the most common approaches are the use of LLMs for automated scoring Lee and Song (2024); Pan and Nehm (2025) and feedback systems Jia et al. (2024). However, these applications primarily serve to offload certain instructional tasks from teachers, while offering limited direct support for enhancing students’ in-class learning experiences. Arora et al. (2025) integrated LLMs into an advanced programming course to evaluate the extent to which LLMs can assist students in completing assignments. Their findings showed that LLMs can significantly improve learning efficiency. However, they also identified challenges related to over-reliance and misuse. The study suggests that well-designed instructional strategies and proper guidance are essential for helping students understand and responsibly use AI tools, ultimately enhancing their readiness for future technology-driven environments. Similarly, the integration of customized role-based agent techniques Song et al. (2024) has been shown to significantly improve the accuracy, guidance quality, and completeness of LLM responses in programming education settings. Previous study O’Keefe (2024) introduced LLMs into English public speaking courses to address the excessive time students spent drafting and revising speeches. The integration of LLM tools effectively improved classroom efficiency and enhanced students’ spoken English proficiency. Jill Watson Maiti and Goel (2024), a virtual teaching assistant, was deployed across three courses at Georgia Tech and Wiregrass Technical College. It effectively guided students from handling basic inquiries to engaging in higher-order question resolution as the course progressed. Similarly, researchers introduced an LLM-based virtual mentoring system into the engineering course System Architecture Dialogue Framework Design Gürtl et al. (2024), and conducted a comparative study with traditional human instructors. The findings indicate that the LLM tutor effectively supported novice learners in understanding the fundamental principles and workflows of architecture design. However, it still fell short in providing the deep, adaptive feedback that experienced human instructors are capable of delivering.

Moreover, recent studies have explored the impact of LLMs in STEM higher education. Prior studies Xu and Ouyang (2022); Yang et al. (2025a) highlighted that LLMs not only enhance domain-specific learning but also promote a shift from instructor-led teaching to student-centered learning paradigms. However, concerns have also been raised. For instance, past research Marquez-Carpintero et al. (2025); Q. Zheng et al. (2023) pointed out that the absence of real human interaction and emotionally personalized feedback may lead to psychological issues among students. Moreover, previous studies Lyu et al. (2024); Rossi et al. (2025) indicated that heavy reliance on LLMs could potentially weaken students’ independent reasoning and critical thinking skills. Beyond the aforementioned applications in programming education, a growing body of research has investigated the integration of LLMs into broader STEM undergraduate curricula. Studies have shown that LLMs can enhance students’ conceptual understanding, support method explanation, and offer guidance on syntactic structures Matzakos and Moundridou (2025), thereby complementing traditional tools such as Computer Algebra Systems (CAS). Specifically, CAS primarily focuses on accurate computation and result validation, whereas LLMs support cognitive processes such as exploration, hypothesis generation, and metacognitive reflection, thereby enabling inquiry-driven learning. In classroom settings, LLMs have also been implemented as “teachable agents” Rogers et al. (2025), where they act as novices, prompting students to engage in teaching activities. This pedagogical strategy encourages students to articulate their reasoning, reflect on their understanding, and identify conceptual gaps.

Beyond class/course support, recent studies have further explored the role of LLMs in enhancing STEM higher education through curriculum evaluation and instructional design. For example, LLMs could streamline syllabus analysis and facilitate interdisciplinary exploration (Hu et al., 2024; Meissner et al., 2024). Furthermore, LLMs have been adopted as benchmarking tools Ali et al. (2024); Meissner et al. (2024), enabling instructors to assess model performance across varying cognitive levels and to reflect on course alignment and difficulty. Other work has leveraged role-reversal strategies, such as configuring LLMs as virtual students to simulate performance under course assessments Jamieson et al. (2025), offering instructors additional insights into content effectiveness and pedagogical balance.

To better align with the evolving demands of STEM education, especially under the influence of LLM technologies, comprehensive empirical investigations are essential to guide pedagogical innovation and curriculum development.

2.2. Motivations

Previous studies on integrating LLMs into higher education generally falls into three major themes:

The development of automated grading and feedback systems aimed at reducing instructors’ workload and improving post-class efficiency.
The deployment of LLMs in classroom teaching as a replacement or supplement for traditional teaching assistants, often accompanied by comparative evaluations.
The construction of expert-tuned LLMs to improve output accuracy and contextual relevance in discipline-specific applications.

However, existing studies fall short in exploring the integration of LLMs into higher education settings that demand deeper disciplinary engagement—particularly STEM-oriented laboratory courses. Two key limitations can be identified as follows:

Domain-Specific Demands: Prior works have largely focused on tasks such as writing, translation, and programming, where abundant datasets and pretrained knowledge bases enable generic-LLMs to perform effectively. In contrast, STEM lab courses often require specialized domain knowledge and context-aware problem-solving. This necessitates both rigorous evaluation of generic-LLMs and the development of expert-tuned models tailored to specific STEM domains.
Binary Treatment Design: Most of previous studies have adopted a binary experimental design—comparing classrooms with LLM assistance versus those without. However, in real-world STEM education, especially within lab-based courses, instructional support is rarely singular. It often involves a combination of teaching assistants, peer collaboration, and technological aids. Therefore, it is critical to explore a more comprehensive set of assistance configurations to understand the nuanced impact of LLM integration.

3. Methodology

This section provides a detailed overview of the lab course design presented in this work. First, we introduce the background of the course setup, including its relevance to prerequisite courses and the teaching approaches used in previous iterations. Next, we describe the five different assistant models incorporated into the lab course, including the expert model specifically developed for this study. Finally, we present the student participation details and explain the data collection methodology.

3.1. Background of Lab Course

The experimental implementation in this work is based on the FPGA-Based Embedded Design, a lab course offered in the third year of undergraduate study. The course is targeted at students majoring in Integrated Circuit Design and Integrated Systems, Microelectronics Science and Engineering, and Electronic Science and Technology. Prior to enrolling in this course, students are required to complete a set of foundational core courses, including Digital Logic Circuits, Digital Circuits: Lab/Experimentation, Embedded Systems, Microcomputer: Principles and Application, Sensor Principles and Technology, Programmable Device Application Technology, and Hardware Description Languages.

FPGA-Based Embedded Design is structured as a self-directed learning-oriented lab course. Figure 1 illustrates the previous process of this course before adopting the different assistance approaches proposed in this work. Over a six-month period, students independently apply their prior knowledge and conduct relevant literature reviews to design and implement a custom application prototype on an FPGA platform. The application domain is unrestricted. Upon completing their prototype, students submit an anonymous project video and supporting documentation. Final grading is determined via a single-blind review process conducted by three different teachers.

3.2. Method Overview

Based on the above background, this study plans to conduct a quantitative field experiment Slavin (2002) within an FPGA-based embedded systems course. Specifically, we adopt a stratified randomized multi-arm experimental design Angrist et al. (2009); Reichardt (2002) conducted in a real classroom setting. To strengthen the evaluation, we complement this with a quasi-experimental comparison against historical cohorts Torgerson (2008). The effectiveness and robustness of different instructional configurations will be assessed through anonymized single-blind grading and statistical analysis.

Following the overview design described above, this study is structured around three overarching research questions that align with the redesigned instructional process, as shown in Figure 2. RQ1 (performance across assistance modes): How do different assistance configurations—ranging from human-only to AI-only and hybrid modes—affect students’ academic performance and the quality of their FPGA design outcomes? RQ2 (value of domain specialization): Does an expert-tuned LLM, trained on HDL-specific corpora, outperform generic LLMs in both technical accuracy and educational impact? RQ3 (operational efficiency): How do different assistance configurations influence TA workload patterns and students’ autonomous engagement over time?

To address these questions, we redesigned the FPGA-Based Embedded Design course into a stratified randomized multi-arm field experiment complemented by quasi-experimental comparison to historical cohorts (Section 3.3). The redesign introduced five assistance modes—(1) TA-only, (2) Generic-LLM-only, (3) Expert-tuned-LLM-only, (4) TA + Generic LLM, and (5) TA + Expert-tuned LLM, restricting cross-access to models, standardizing deliverables, and applying single-blind multi-rater grading.

The expert-tuned LLM (Section 3.4) was developed through a multi-agent retrieval-augmented generation (RAG) pipeline that integrates HDL-specific datasets, discriminator and prompt-review agents, and tool-specific submodels (e.g., Quartus and Vivado). Approximately 300,000 verified HDL pairs were curated from GitHub, OpenCores, and HDLBits, forming a high-fidelity training corpus for domain specialization. Model performance was internally benchmarked using HDL pass@k metrics against public models (ChatGPT-4o, DeepSeek, Gemini 1.5 Pro, and Grok 1.5).

A total of 218 undergraduate students from three EE-related majors participated, with 177 students from prior semesters serving as a historical baseline. Students were stratified by prerequisite course performance and randomly assigned within grade bands to ensure balance across conditions (Section 3.5). Each mode followed an identical semester schedule and deliverable format; hybrid modes received structured open-lab guidance emphasizing prompt refinement rather than direct problem-solving. Data collection (Section 3.6), consistent with the schematic’s three-tier framework, encompassed: model-level data, HDL code generation pass rate, and benchmarking results. We also included student-level data, final project grades, and prerequisite course grades for 218 participants and 177 baseline peers; operational data, TA assistance duration, open-lab participation, and temporal workload distributions.

Collectively, this structure establishes a three-dimensional analytic framework linking instructional performance (RQ1), model specialization (RQ2), and human–AI operational synergy (RQ3), enabling a holistic examination of how expert-tuned LLMs and teaching assistants jointly shape learning outcomes in advanced engineering education.

3.3. Framework of Redesign Course

The process of the redesigned FPGA-Based Embedded Course is shown in Figure 3. A total of five different assistant modes are introduced to support students in completing their custom prototype designs, as listed in Table 1: (1) Teaching Assistant (TA) only, (2) Generic-LLM-only model, (3) Expert-tuned-LLM-only model, (4) TA + Generic LLM model, and (5) TA + Expert-tuned LLM model. The TAs consist of second-year or higher doctoral students and Ph.D. candidates from relevant disciplines. The general LLM models include public available conversational AI systems such as ChatGPT, DeepSeek, Gemini, and Grok. While using a single Generic-LLM might offer stricter experimental control, we intentionally did not impose this restriction to align with real-world classroom conditions. In practice, it is difficult to prevent students from privately using generic LLMs outside of prescribed constraints. The expert-tuned LLM is a model specifically developed for this course through knowledge distillation based on the previous ChatTCL design. This model is provided only to students in Modes (3) and (5), and its design will be described in detail in Section 3.4.

In Mode (1), the TA is limited to addressing practical engineering issues and writing-related questions encountered during the design process. The TA does not respond to any inquiries regarding LLM usage, even if students in Mode (1) privately use LLMs. In Modes (2) and (3), no human TA is involved; students complete their designs solely with the assistance of LLMs, differing only in whether they use a generic model or an expert-tuned model. Under the current configuration, Mode (2) students cannot access the expert model, although it is not possible to fully prevent Mode (3) students from privately using Generic-LLMs. In Modes (4) and (5), TAs do not directly assist with practical engineering or writing issues. Their contribution is confined to helping students enhance or adjust prompts used in LLM interactions. The design of modes (4) and (5) is grounded in prior studies Jishan et al. (2024); Kim et al. (2023); Y. Ma et al. (2024) demonstrating that prompt quality substantially affects LLM performance. This effect is amplified in specialized fields, where well-structured prompts significantly improve LLM outputs. For STEM undergraduates, this underscores the challenge of precisely formulating domain-specific prompts.

Students are randomly assigned to one of these five modes, with the randomization constrained by the overall distribution of students’ prerequisite course grades to ensure comparability across modes. The detailed allocation method will be described in Section 3.5. Under different assistance modes, students conduct literature reviews, determine the project content, implement the design, and prepare both a project video and a written report. These deliverables are then anonymously submitted, and three instructors perform single-blind grading. As described in the Section 3.1, the FPGA-Based Embedded Course has historically been a student-driven experimental course. Therefore, redesigning the course to incorporate different assistant modes and assigning them randomly does not introduce unfairness.

3.4. Expert-Tuned LLM Design

FPGA-based development requires the use of Hardware Description Languages (HDL) or HLS Chen et al. (2023), yet the limited availability of relevant data often causes Generic LLMs to produce hallucinations, leading to substantial discrepancies from actual development needs (Yang et al., 2025b; Yao et al., 2025). Our work builds directly on ChaTCL Rui et al. (2025), which demonstrated that multi-agent RAG architectures can substantially improve LLM accuracy in TCL script generation for EDA tools. Extending this paradigm, we employed knowledge distillation to construct an expert-tuned LLM specifically adapted for this course.

This study adopts a multi-agent architecture based on LLMs, aiming to accelerate the FPGA-based design process by interpreting students’ natural language inputs and generating the corresponding HDL code, as illustrated in Figure 4. A discriminator agent processes user prompts by querying the Non-Domain and Domain databases within an HDL-specific multi-agent retrieval-augmented generation (RAG) framework. The prompt-review agent then reconstructs and classifies the inputs, invoking either a non-domain agent or a tool-specific agent (e.g., Quartus and Vivado) to generate the corresponding HDL code.

To this end, we provide a fine-grained optimized FPGA development tool dataset, divided into two domains: Non-Domain and Domain. The Non-Domain subset focuses on common HDL designs, such as communication modules, buses, and systolic arrays. The Domain subset is dedicated to FPGA factory-provided IP application source code. Each domain is further divided into two task types: command invocation and function/script generation. The data sources for training the model include GitHub repositories, engineering forums (OpenCores), and various open-source HDL problem-solving platforms (HDLBits). We leverage existing large language models to generate HDL code, with all descriptions manually reviewed and corrected to ensure accuracy. The verified descriptions and their corresponding HDL implementations are paired and stored in JSON format. The dataset contains four key fields: (1) category (Non-Domain or Domain); (2) input field containing specific contextual information or task prompts; and (3) output field containing the expected result. The data collection and validation process is supervised by ten experienced FPGA engineers, yielding approximately 300,000 high-quality training samples. In addition, data exceeding the maximum token length is pruned to avoid issues during model training. This dataset is designed to optimize LLM training performance across a range of applications and to enhance its ability to generate domain-specific HDL code.

The core functionality of the proposed expert-tuned LLM is a multi-agent system that interprets user prompts and generates HDL code, implemented in two stages, as illustrated in Figure 4. Stage 1: The discriminator agent processes the natural language input and routes it to a specific RAG system for database retrieval. The prompt-review agent then reconstructs the prompt and assigns it to the appropriate agent. For DomainT cases, the system determines the corresponding FPGA tool. Stage 2: Multiple specialized agents, each powered by a fine-tuned HDL-specific LLM, generate HDL code based on the reconstructed prompt. The task is routed either to a Non-Domain agent (for general HDL) or to a tool-specific agent (e.g., Quartus and Vivado). Relevant RAG database examples support script generation, and the final output is delivered to the student.

The evaluation methodology and data collection process for the proposed expert-tuned LLM will be described in Section 3.6, and the corresponding results and analysis will be presented in Section 4.1.

3.5. Participants

A total of 218 students participated in this study, all of whom were voluntarily enrolled in the FPGA-Embedded Design course, as detailed in Table 2. The students were from three academic majors: Integrated Circuit Design and Integrated Systems, Microelectronics Science and Engineering, and Electronic Science and Technology, and had completed a standard set of foundational core courses, including Digital Logic Circuits, Digital Circuits: Lab/Experimentation, Embedded Systems, Microcomputer: Principles and Application, Sensor Principles and Technology, Programmable Device Application Technology, and Hardware Description Languages. Students were made aware prior to enrollment that the course had traditionally followed a self-directed learning-oriented model without formal TA support, and that this year they would be randomly assigned to different TA-support modes.

The 218 students enrolled in this course had all completed a set of prerequisite courses. These courses used objective grading standards based on fixed answers, minimizing instructor bias. For later grouping, we mapped the scores to nine grades, as illustrated in Table 3. Following established practices Hailikari et al. (2008); Salehi et al. (2019), we grouped the 218 students into five assistant modes using grade levels from five scalar-scored prerequisite courses, disregarding academic background. This approach maintains the original performance distribution, as illustrated in Figure 5.

Specifically, we adopted a stratified random assignment strategy to ensure that the distribution of prerequisite course grade within each assistant mode closely matches the overall grade distribution. First, within each grade level, the students were randomly shuffled. Then, following the principle of balancing group sizes as equally as possible, students in that grade were assigned to the five groups. Any remaining students due to uneven division were randomly assigned to one of the groups. Finally, the assignment results across all grade levels were merged to form the complete student allocation for each assistant mode.

To verify that the performance distribution of students in each assistant mode is consistent with the overall distribution, we applied Total Variation Distance (TVD) and the Kolmogorov–Smirnov (KS) test. The results are summarized in Table 4. As shown, the TVD values for the five groups range from approximately 0.02 to 0.03, while the KS statistics fall within 0.07 to 0.09, with corresponding p-values close to 1.

A total of 8 human teaching assistants participated in this course. Over the six-month duration of the course, they provided weekly in-person support in the open lab, scheduled as follows:

Every Tuesday, from 2:00–4:00 PM, for students in the TA-only assistant mode;
Every Wednesday, from 2:00–4:00 PM, for students in the TA + Generic-LLM assistant mode;
Every Thursday, from 2:00–4:00 PM, for students in the TA + Expert-tuned-LLM assistant mode.

3.6. Data Collection

The experimental data collection was conducted across three levels, as summarized in Table 5. First, data related to the evaluation of the Expert-tuned LLM. Second, student grades from the course were collected. Lastly, relevant data from the TA perspective was recorded.

We evaluated the proposed expert-tuned LLM using a test set constructed from available sources, including GitHub repositories, CSDN forums, and user manuals. This test set is independent from the dataset used during fine-tuning. For comparison, several generic LLMs were evaluated using the same benchmark, such as ChatGPT-4o, DeepSeek, Gemini 1.5 Pro, and Grok 1.5. The evaluation metric is the pass rate, defined as the proportion of test cases where the generated HDL code, when combined with the corresponding testbench, produces correct simulation waveforms in an EDA environment. The correctness of the simulation outputs was manually assessed by 10 experienced FPGA engineers.

For student grade analysis, we collected the final course grades and the prerequisite course grades for all 218 students who participated in the redesigned FPGA-Based Embedded Course. As a baseline for comparison, we also gathered the same data from 177 students who took the course in the previous semester.

From the TA perspective, we asked each TA to record two key metrics throughout the course: (1) the number of students who participated in the weekly in-person support sessions in the open lab, and (2) the duration of assistance provided by each TA during the course.

4. Results and Data Analysis

Upon completion of the course, three teachers evaluated the students’ anonymously submitted final projects and recorded the corresponding scores. To be more specific, the evaluation adopts a single-blind review, similar to academic paper peer review. Three instructors independently review each project, having access only to a project video and written report, without knowing student identities. Figure 6 presents selected examples of these student designs. The analysis of collected results is structured into three parts: First, Section 4.1 examines the effectiveness of the expert-tuned LLM. Second, Section 4.2 compares student outcomes across different assistant models. Finally, Section 4.3 draws on TA-collected observations to offer a complementary instructional perspective.

4.1. Results for Proposed LLM

We compared the proposed Expert-tuned LLM against ChatGPT-4o, DeepSeek, Gemini 1.5 Pro, and Grok 1.5, using pass@k as the evaluation metric. The pass@k evaluates the performance of code generation models by generating n codes for each given problem, where n > k. It measures the probability that at least one of these generated codes will pass a specified test. To assess correctness, we invited 10 experienced FPGA engineers to evaluate the functional behavior of the HDL code generated by each model on the test dataset, based on simulation results in EDA tools. The results are summarized in Table 6.

The proposed model achieves the highest performance across all evaluated settings, reaching 60.7% in pass@1, 78.5% in pass@5, and 88.4% in pass@10. ChatGPT-4o shows competitive performance, while Gemini 1.5 Pro exhibits comparable pass@1 accuracy (47.9%) but slightly higher pass@10 (79.3%). DeepSeek shows relatively low single-attempt accuracy (27.9%) and modest improvements with multiple attempts (55.6% for pass@10), indicating limited diversity or correctness in generated outputs. Grok 1.5 performs better than DeepSeek but remains significantly behind the proposed model, particularly in the multi-attempt settings. Overall, the results indicate that the proposed approach not only delivers superior one-shot accuracy but also maintains the highest success rates when leveraging multiple attempts, highlighting its robustness and reliability in HDL code generation tasks.

4.2. Results for Student Grades

To evaluate the impact of different assistant modes on students’ grade transitions, we collected grade data from 218 students in the current course offering and 177 students from a previous cohort, and converted their raw scores into discrete grade levels. As illustrated in Figure 7, a heatmap visualizes the distribution of grade changes under each assistant mode. For the previous class (independent design), grade transitions are heavily concentrated along the main diagonal, indicating minimal change. Only 21% of students improved by at least one grade level, while 33% performed worse compared to their prior prerequisite course performance. A similar trend persists in the Generic-LLM-Only mode, where 32% of students showed grade improvement. However, statistical analysis indicates that the overall distribution does not significantly differ from the baseline (p > 0.1). In contrast, all modes involving TAs yielded statistically significant and consistent improvements in grade transitions. In the TA-only mode, nearly 59% of students improved by at least one grade level, with less than 8% experiencing a decline. Comparable results were observed in TA + Generic-LLM mode (52% improved, 4% declined) and TA + Expert-tuned-LLM mode (41% improved, 9% declined), demonstrating the strong and stabilizing influence of human-in-the-loop guidance in LLM-assisted educational settings. Based on our design and logs, we see several plausible mechanisms: In the hybrid modes, TAs were explicitly restricted to prompt coaching and did not provide direct help on engineering or writing issues, whereas TA-only offered such domain help. This design choice can naturally favor TA-only in complex domain specific tasks. Furthermore, when comparing the expert-tuned-LLM mode to the generic-LLM mode, we observe a 23% increase in the proportion of students with grade improvements and a 32% reduction in those experiencing declines. This indicates that even without direct human intervention, model specialization significantly enhances the educational effectiveness of LLM-based assistance.

In particular, the TA-assisted modes exhibited more frequent upward shifts in final course grades for students with mid-to-low prior performance. However, a clearer understanding of the magnitude and consistency of these changes across the cohort requires quantitative analysis beyond categorical grade transitions. Table 7 summarizes the statistical characteristics of score changes across different assistant modes. Historical data shows a positive correlation between prerequisite course grades and performance in this course. When supported only by generic-LLMs, student performance improvements were limited, as follows: 86.96% of students had score changes within [−5, 0) and [0, 5), similar to the 82.48% observed in previous self-directed course iterations. This suggests generic LLMs alone provide limited academic benefit. In contrast, assistance involving human TAs led to higher average improvements (3.74–4.19 points) and greater upward mobility. For instance, 28.26% of students in the TA + Generic-LLM mode improved by 5–10 points, with 10.87% gaining more than 10 points. Similarly, in the TA + Expert-tuned-LLM mode, 27.73% improved within [0, 5), and over 29% exceeded 5-point improvements. Notably, no students in the TA-involved groups experienced a drop greater than 5 points, indicating that these modes not only boost performance but also minimize the risk of significant regression—highlighting their stability and instructional value.

While the standalone use of generic-LLMs offers limited benefits, the distinction between generic-LLM and expert-tuned-LLM is significant. The expert-tuned-LLM yields a wider range of positive score changes, with 11.63% of students improving by more than 10 points, compared to just 2.17% under the generic-LLM. Additionally, it reduces the concentration of score declines within the [−10, −5] range from 18.60% to 6.52%, indicating stronger adaptability to varying student needs. This finding is consistent with the superior accuracy of the expert-tuned LLM demonstrated in Section 4.1.

4.3. TA-Collected Observations

The above results highlight the significance of human TA involvement in improving students’ performance in the course. To further compare the differences across various TA-supported modes, teaching assistants recorded the duration of each monthly support session in the open lab, along with the student attendance rate.

Figure 8 illustrates the attendance trends in open lab sessions over the course duration. In the TA-only mode, participation remained high during the initial phase (approximately 98% in the 1st month), gradually declined to 54% by the 5th month, and then rebounded to 85% as the project deadline approached. In contrast, both LLM-supported modes exhibited a more significant decline in mid-course participation, reaching levels below 15% in the 5th month. Rather than indicating a loss of motivation, this steep decline in attendance during the middle phase of LLM-supported modes may reflect increased learner independence. As students became more proficient in prompt formulation and problem-solving, they increasingly relied on LLMs to address routine issues, thereby reducing their dependence on in-person TA support.

Figure 9 presents the weekly TA support time recorded across three instructional modes throughout the 6-month course progress. In the TA-only mode, teaching assistants consistently provided 2 h of in-person support each week, regardless of fluctuations in student attendance. This steady level of engagement suggests that student questions requiring human guidance continued to arise throughout the course, and that lower attendance did not translate to reduced TA workload. In contrast, both TA + Generic-LLM and TA + Expert-tuned LLM modes show a distinct trend: high TA involvement in the early stages, which gradually declined to 0.5 to 1.0 h per week after the 2rd month, followed by an uptick in support time around the 5th month, coinciding with increased pressure approaching final project deadlines. The elevated early-stage TA workload in hybrid modes highlights the time invested in coaching students on how to formulate effective prompts. This front-loaded support enabled the LLMs to provide more targeted and autonomous assistance in subsequent weeks, thereby reducing the sustained burden on TAs during the middle phase of the course.

5. Conclusions, Limitations, and Future Works

5.1. Conclusions

This study evaluates the impact of different instructional support modes on student performance in an experimental course for electrical and computer engineering majors. A previously self-directed learning-oriented course was restructured into five distinct assistant models: (1) TA only, (2) Generic-LLM-only, (3) Expert-tuned-LLM-only, (4) TA + Generic LLM, and (5) TA + Expert-tuned LLM. To support this design, we also developed a dedicated Expert-tuned LLM specifically tailored for the course content and tasks. Course outcomes demonstrate that

The Generic-LLM-only mode yielded limited benefits: Compared to the baseline of prior self-directed learning cohorts, this mode did not lead to statistically significant improvements in student performance. This suggests that in specialized scientific and engineering domains, the unguided use of general-purpose language models may not produce meaningful learning gains.
Human TA involvement yielded significant benefits: All TA participation modes led to notable improvements in student performance, with a consistent upward trend in grades and a marked reduction in the risk of score decline. These results strongly affirm the enduring value of human expertise in lab-based instructional settings, particularly in guiding complex, hands-on learning tasks.
Expert-tuned LLM outperform Generic LLMs in domain-specific contexts: The expert-tuned LLM demonstrates superior performance in both HDL code generation accuracy and student learning outcomes, even in the absence of direct TA involvement. This highlights the critical importance of domain-specific optimization for educational AI tools, particularly in technical fields like hardware design.
Synergistic Effects in Hybrid Modes: The TA + LLM modes achieve the best balance between performance improvement and learner autonomy. Early-stage guidance from human TAs appeared to foster more effective and independent use of LLMs in later phases of the course. This highlights the pedagogical advantage of structured hybrid learning environments where human guidance and AI support are strategically integrated.

5.2. Limitations

Several limitations should be acknowledged:

The study focused exclusively on an FPGA-based embedded design course within electronics-related majors, which may limit generalizability to other STEM domains or non-laboratory courses.
Although the tested performance of generic-LLMs under the course task requirements is a minor difference (as shown in Table 6), it would be preferable to control for all variables by restricting students to a single generic-LLM.
While attendance logs and TA working hours provide objective measures of engagement and workload, the study does not capture qualitative aspects of student-TA or student-LLM interactions (such as the complexity of questions or the presence of hallucinations in model responses).
Cross-condition contamination may have occurred. Students in the LLM-only modes may have informally consulted human TAs, while those assigned to the expert-tuned LLM condition may have accessed generic-LLMs. Although such behaviors were discouraged, they cannot be entirely ruled out.
Although this study included 218 students, the data was collected within a single academic term. Longer-term experimentation and multi-semester data collection would help reduce statistical noise and provide more stable observations. Such longitudinal studies could also uncover finer-grained distinctions among the three TA-support models.

5.3. Future Works

Future research could extend this work in several directions: First, future work should extend this multi-modal instructional framework to other STEM lab-based settings (such as robotics or chemical process control) to evaluate its generalizability across disciplines. Second, conducting automated analysis of prompt–response interactions (e.g., prompt clarity, response hallucination rate, or task complexity) would offer deeper insights into how prompt quality affects learning efficacy. Last but not least, assembling all of these will be instrumental in shaping data-informed policies for hybrid instruction and guiding the next generation of STEM education strategies.

Author Contributions

Conceptualization, Z.L. and S.H.; methodology, S.H.; software, S.H.; validation, S.H.; formal analysis, S.H.; investigation, Z.L. and B.Y.; resources, S.H.; writing—original draft preparation, Z.L. and B.Y.; writing—review and editing, Z.L. and S.H.; visualization, Z.L. and B.Y.; supervision, S.H.; project administration, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The analyses used de-identified data collected prior to this study.

Informed Consent Statement

Prior to participation, all individuals were informed about the study’s aims and the anonymous and voluntary nature of the course through an initial statement in the enrollment notification. This statement also made clear that by participating, they consent to the anonymized use of their data solely for statistical analysis. We have ensured that it is technically impossible to identify any participant from the data collected, maintaining the strictest levels of confidentiality and data protection.

Data Availability Statement

The data presented in this article are available within the text. Additional data can be requested from the corresponding author.

Acknowledgments

The authors would like to sincerely thank the VeriMake Innovation Lab for providing the GPU resources required for model training, and Yanxiang Zhu from VeriMake for his technical support in the design of the expert-tuned LLM. The authors also express deep appreciation to Renping Wang and Wei Lin from Fuzhou University for their participation and support in the course design. Special thanks are also extended to Weibin Jiang, Yajing Liu, Jing Yang, Xinhai Wang, Weifeng Lin, Huangbin Su, Zihang Wu, and Xinlin Xiao from Fuzhou University for their assistance throughout the course.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
EE	Electrical and Electronics
EDA	Electronic Design Automation
FPGA	Field-Programmable Gate Array
GPU	Graphics Processing Unit
HDL	Hardware Description Languages
IP	Intellectual Property
JSON	JavaScript Object Notation
KS	Kolmogorov–Smirnov
LLM	Large Language Model
RAG	Retrieval-Augmented Generation
STEM	Science, Technology, Engineering and Mathematics
TA	Teaching Assistant
TVD	Total Variation Distance

References

Ali, M., Rao, P., Mai, Y., & Xie, B. (2024, August 13–15). Using benchmarking infrastructure to evaluate LLM performance on CS concept inventories: Challenges, opportunities, and critiques. 2024 ACM Conference on International Computing Education Research (Volume 1, pp. 452–468), Melbourne, Australia. [Google Scholar] [CrossRef]
Angrist, J., Lang, D., & Oreopoulos, P. (2009). Incentives and services for college achievement: Evidence from a randomized trial. American Economic Journal: Applied Economics, 1(1), 136–163. [Google Scholar] [CrossRef]
Arora, U., Garg, A., Gupta, A., Jain, S., Mehta, R., Oberoi, R., Prachi, Raina, A., Saini, M., Sharma, S., Singh, J., Tyagi, S., & Kumar, D. (2025, February 12–13). Analyzing LLM usage in an advanced computing class in India. 27th Australasian Computing Education Conference (pp. 154–163), Brisbane, Australia. [Google Scholar] [CrossRef]
Askarbekuly, N., & Aničić, N. (2024). LLM examiner: Automating assessment in informal self-directed e-learning using ChatGPT. Knowledge and Information Systems, 66(10), 6133–6150. [Google Scholar] [CrossRef]
Chen, R., Zhang, H., Li, S., Tang, E., Yu, J., & Wang, K. (2023, September 4–8). Graph-opu: A highly integrated fpga-based overlay processor for graph neural networks. 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL) (pp. 228–234), Gothenburg, Sweden. [Google Scholar] [CrossRef]
Englmeier, K., & Contreras, P. (2024, August 29–31). How AI can help learners to develop conceptual knowledge in digital learning environments. 2024 IEEE 12th International Conference on Intelligent Systems (IS) (pp. 1–6), Varna, Bulgaria. [Google Scholar] [CrossRef]
Gürtl, S., Scharf, D., Thrainer, C., Gütl, C., & Steinmaurer, A. (2024). Design and evaluation of an LLM-based mentor for software architecture in higher education project management classes. In International conference on interactive collaborative learning (pp. 375–386). Springer. [Google Scholar] [CrossRef]
Hailikari, T., Katajavuori, N., & Lindblom-Ylanne, S. (2008). The relevance of prior knowledge in learning and instructional design. American Journal of Pharmaceutical Education, 72(5), 113. [Google Scholar] [CrossRef] [PubMed]
Ho, H.-T., Ly, D.-T., & Nguyen, L. V. (2024, November 3–6). Mitigating hallucinations in large language models for educational application. 2024 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia) (pp. 1–4), Sheraton Grand Danang, Vietnam. [Google Scholar] [CrossRef]
Hu, B., Zheng, L., Zhu, J., Ding, L., Wang, Y., & Gu, X. (2024). Teaching plan generation and evaluation with GPT-4: Unleashing the potential of LLM in instructional design. IEEE Transactions on Learning Technologies, 17, 1445–1459. [Google Scholar] [CrossRef]
Imran, M., & Almusharraf, N. (2023). Analyzing the role of ChatGPT as a writing assistant at higher education level: A systematic review of the literature. Contemporary Educational Technology, 15(4), ep464. [Google Scholar] [CrossRef] [PubMed]
Jacobsen, L. J., & Weber, K. E. (2025). The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of AI-driven feedback. AI, 6(2), 35. [Google Scholar] [CrossRef]
Jamieson, P., Bhunia, S., Ricco, G. D., Swanson, B. A., & Van Scoy, B. (2025, June 22–25). LLM prompting methodology and taxonomy to benchmark our engineering curriculums. 2025 ASEE Annual Conference & Exposition, Montreal, QC, Canada. [Google Scholar] [CrossRef]
Jia, Q., Cui, J., Du, H., Rashid, P., Xi, R., Li, R., & Gehringer, E. (2024, July 14–17). LLM-generated feedback in real classes and beyond: Perspectives from students and instructors. 17th International Conference on Educational Data Mining (pp. 862–867), Atlanta, GA, USA. [Google Scholar] [CrossRef]
Jishan, M. A., Allvi, M. W., & Rifat, M. A. K. (2024, December 11–12). Analyzing user prompt quality: Insights from data. 2024 International Conference on Decision aid Sciences and Applications (DASA) (pp. 1–5), Manama, Bahrain. [Google Scholar] [CrossRef]
Kim, J., Lee, S., Hun Han, S., Park, S., Lee, J., Jeong, K., & Kang, P. (2023, November 1). Which is better? Exploring prompting strategy for LLM-based metrics. 4th Workshop on Evaluation and Comparison of NLP Systems (pp. 164–183), Bali, Indonesia. [Google Scholar] [CrossRef]
Lee, J. X., & Song, Y.-T. (2024, July 5–7). College exam grader using LLM AI models. 2024 IEEE/ACIS 27th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD) (pp. 282–289), Beijing, China. [Google Scholar] [CrossRef]
Li, R., Li, M., & Qiao, W. (2025). Engineering students’ use of large language model tools: An empirical study based on a survey of students from 12 universities. Education Sciences, 15(3), 280. [Google Scholar] [CrossRef]
Lyu, W., Wang, Y., Chung, T. R., Sun, Y., & Zhang, Y. (2024, July 18–20). Evaluating the effectiveness of LLMs in introductory computer science education: A semester-long field study. Eleventh ACM Conference on Learning @ Scale (pp. 63–74), Atlanta, GA, USA. [Google Scholar] [CrossRef]
Ma, S., Wang, J., Zhang, Y., Ma, X., & Wang, A. Y. (2025, April 26–May 1). DBox: Scaffolding algorithmic programming learning through learner-LLM co-decomposition. 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. [Google Scholar] [CrossRef]
Ma, Y., Shen, X., Wu, Y., Zhang, B., Backes, M., & Zhang, Y. (2024, November 12–16). The death and life of great prompts: Analyzing the evolution of LLM prompts from the structural perspective. 2024 Conference on Empirical Methods in Natural Language Processing (pp. 21990–22001), Miami, FL, USA. [Google Scholar] [CrossRef]
Maiti, P., & Goel, A. K. (2024, July 14–17). How do students interact with an LLM-powered virtual teaching assistant in different educational settings? 17th International Conference on Educational Data Mining Workshops (pp. 1–9), Atlanta, GA, USA. [Google Scholar] [CrossRef]
Marquez-Carpintero, L., Viejo, D., & Cazorla, M. (2025). Enhancing engineering and STEM education with vision and multimodal large language models to predict student attention. IEEE Access, 13, 114681–114695. [Google Scholar] [CrossRef]
Matzakos, N., & Moundridou, M. (2025). Exploring large language models integration in higher education: A case study in a mathematics laboratory for civil engineering students. Computer Applications in Engineering Education, 33(3), e70049. [Google Scholar] [CrossRef]
Meissner, R., Pögelt, A., Ihsberner, K., Grüttmüller, M., Tornack, S., Thor, A., Pengel, N., Wollersheim, H.-W., & Hardt, W. (2024). LLM-generated competence-based e-assessment items for higher education mathematics: Methodology and evaluation. Frontiers in Education, 9, 1427502. [Google Scholar] [CrossRef]
Mienye, I. D., & Swart, T. G. (2025). ChatGPT in education: A review of ethical challenges and approaches to enhancing transparency and privacy. Procedia Computer Science, 254, 181–190. [Google Scholar] [CrossRef]
Morton, J. L. (2025). From meaning to emotions: LLMs as artificial communication partners. AI & SOCIETY, 1–14. [Google Scholar] [CrossRef]
O’Keefe, J. G. (2024). Increasing practice time for university presentation class students by implementing large language models (LLM). Fukuoka Jo Gakuin University Professional Teacher Education Support Center Educational Practice Research, 8, 33–52. [Google Scholar] [CrossRef]
Pan, Y., & Nehm, R. H. (2025). Large language model and traditional machine learning scoring of evolutionary explanations: Benefits and drawbacks. Education Sciences, 15(6), 676. [Google Scholar] [CrossRef]
Polakova, P., & Klimova, B. (2023). Using DeepL translator in learning English as an applied foreign language—An empirical pilot study. Heliyon, 9(8), e18595. [Google Scholar] [CrossRef] [PubMed]
Razafinirina, M. A., Dimbisoa, W. G., & Mahatody, T. (2024). Pedagogical alignment of large language models (llm) for personalized learning: A survey, trends and challenges. Journal of Intelligent Learning Systems and Applications, 16(4), 448–480. [Google Scholar] [CrossRef]
Reichardt, C. S. (2002). Experimental and quasi-experimental designs for generalized causal inference. JSTOR. [Google Scholar]
Rogers, K., Davis, M., Maharana, M., Etheredge, P., & Chernova, S. (2025, April 26–May 1). Playing dumb to get smart: Creating and evaluating an LLM-based teachable agent within university computer science classes. 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. [Google Scholar] [CrossRef]
Rossi, V., Scantamburlo, T., & Melonio, A. (2025, July 22–26). Generative AI for non-techies: Empirical insights into LLMs in programming education for novice non-STEM learners. International Conference on Artificial Intelligence in Education (pp. 162–175), Palermo, Italy. [Google Scholar] [CrossRef]
Rui, Y., Li, Y., Wang, R., Chen, R., Zhu, Y., Di, Z., Wang, X., & Ling, M. (2025, May 9–12). ChaTCL: LLM-based multi-agent RAG framework for TCL script generation. 2025 International Symposium of Electronics Design Automation (ISEDA) (pp. 736–742), Hong Kong, China. [Google Scholar] [CrossRef]
Salehi, S., Burkholder, E., Lepage, G. P., Pollock, S., & Wieman, C. (2019). Demographic gaps or preparation gaps?: The large impact of incoming preparation on performance of students in introductory physics. Physical Review Physics Education Research, 15(2), 020114. [Google Scholar] [CrossRef]
Slavin, R. E. (2002). Evidence-based education policies: Transforming educational practice and research. Educational Researcher, 31(7), 15–21. [Google Scholar] [CrossRef]
Song, T., Zhang, H., & Xiao, Y. (2024). A high-quality generation approach for educational programming projects using LLM. IEEE Transactions on Learning Technologies, 17, 2242–2255. [Google Scholar] [CrossRef]
Tan, K., Yao, J., Pang, T., Fan, C., & Song, Y. (2024). ELF: Educational LLM framework of improving and evaluating AI generated content for classroom teaching. ACM Journal of Data and Information Quality, 17, 1–23. [Google Scholar] [CrossRef]
Taveekitworachai, P., Abdullah, F., & Thawonmas, R. (2024, November 12–16). Null-shot prompting: Rethinking prompting large language models with hallucination. 2024 Conference on Empirical Methods in Natural Language Processing (pp. 13321–13361), Miami, FL, USA. [Google Scholar] [CrossRef]
Torgerson, D. (2008). Designing randomised trials in health, education and the social sciences: An introduction. Springer. [Google Scholar]
Wermelinger, M. (2023, March 15–18). Using GitHub copilot to solve simple programming problems. 54th ACM Technical Symposium on Computer Science Education (pp. 172–178), Toronto, ON, Canada. [Google Scholar] [CrossRef]
Xu, W., & Ouyang, F. (2022). The application of AI technologies in STEM education: A systematic review from 2011 to 2021. International Journal of STEM Education, 9(1), 59. [Google Scholar] [CrossRef]
Yan, L., Sha, L., Zhao, L., Li, Y., Martinez-Maldonado, R., Chen, G., Li, X., Jin, Y., & Gašević, D. (2024). Practical and ethical challenges of large language models in education: A systematic scoping review. British Journal of Educational Technology, 55(1), 90–112. [Google Scholar] [CrossRef]
Yang, Y., Sun, W., Sun, D., & Salas-Pilco, S. Z. (2025a). Navigating the AI-enhanced STEM education landscape: A decade of insights, trends, and opportunities. Research in Science & Technological Education, 43(3), 693–717. [Google Scholar] [CrossRef]
Yang, Y., Teng, F., Liu, P., Qi, M., Lv, C., Li, J., Zhang, X., & He, Z. (2025b, March 31–April 2). Haven: Hallucination-mitigated llm for verilog code generation aligned with hdl engineers. 2025 Design, Automation & Test in Europe Conference (DATE) (pp. 1–7), Lyon, France. [Google Scholar] [CrossRef]
Yao, X., Li, H., Chan, T. H., Xiao, W., Yuan, M., Huang, Y., Chen, L., & Yu, B. (2025). HDLdebugger: Streamlining HDL debugging with Large Language Models. ACM Transactions on Design Automation of Electronic Systems, 30, 1–26. [Google Scholar] [CrossRef]
Zheng, Q., Mo, T., & Wang, X. (2023). Personalized feedback generation using LLMs: Enhancing student learning in STEM education. Journal of Advanced Computing Systems, 3(10), 8–22. [Google Scholar] [CrossRef]
Zheng, X., Li, Z., Gui, X., & Luo, Y. (2025, April 26–May 1). Customizing emotional support: How do individuals construct and interact with LLM-powered chatbots. 2025 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan. [Google Scholar] [CrossRef]
Zhui, L., Fenghe, L., Xuehu, W., Qining, F., & Wei, R. (2024). Ethical considerations and fundamental principles of large language models in medical education. Journal of Medical Internet Research, 26, e60083. [Google Scholar] [CrossRef]

Figure 1. The previous process of FPGA-Based Embedded Design course before adopting the different assistance approaches proposed in this work. It was a self-directed learning-oriented course and lacked formal instructional assistance.

Figure 2. Research method overview.

Figure 3. The proposed process of FPGA-Based Embedded Design course with different assistance approaches.

Figure 4. Overview of proposed expert-tuned LLM design.

Figure 5. Distribution of prerequisite course grades for all 218 students and across assistant modes after grouping.

Figure 6. Selected examples from enrollment student designs.

Figure 7. Heatmap comparison of the relationship between prerequisite course grades and FPGA-based embedded design grades across different TA models and previous class student.

Figure 8. Open lab participation rate by month over the course progress.

Figure 9. TA assistance duration in open labs over the course progress.

Table 1. Overview of assistance modes and features.

Assistance Modes	Features
Mode (1) Teaching Assistant (TA) only	Addressing practical engineering issues and writing-related questions. TA does not respond to any inquiries regarding LLM usage, even if students privately use LLM.
Mode (2) Generic LLM only	No human TA is involved. Students complete their designs solely with the assistance of publicly available generic LLMs (such as ChatGPT, DeepSeek, Gemini, and Grok), but cannot access the expert-tuned model.
Mode (3) Expert-tuned LLM only	No human TA is involved. Students complete their designs solely with the assistance of the expert-tuned LLM, although it is not possible to fully prevent students from privately using generic LLMs.
Mode (4) TA + Generic LLM	TAs do not directly assist with practical engineering or writing issues. Their contribution is confined to helping students enhance or adjust prompts used in generic LLM interactions.
Mode (5) TA + Expert-tuned LLM	TAs do not directly assist with practical engineering or writing issues. Their contribution is confined to helping students enhance or adjust prompts used in expert-tuned LLM interactions.

Table 2. Major-wise statistics of student enrollment.

Integrated Circuit Design and Integrated Systems	Microelectronics Science and Engineering	Electronic Science and Technology	Total
83	69	66	218

Table 3. Score-to-grade conversion.

Grade	Score Ranges
A+	>95
A	90–95
A−	85–89
B+	80–84
B	75–79
B−	70–74
C+	65–69
C	60–64
C−	<60

Table 4. Comparison of student grade distributions across assistant modes and the overall distribution.

Assistant Modes	The Number of Students	TVD	KS-Test
Assistant Modes	The Number of Students	TVD	KS	p
TA-Only	39	0.0331	0.0879	0.9371
Generic-LLM-Only	46	0.0231	0.0718	0.9800
Expert-tuned-LLM-Only	43	0.0289	0.0669	0.9929
TA + Generic-LLM	46	0.0231	0.0734	0.9755
TA + Expert-tuned-LLM	44	0.0221	0.0730	0.9801

Table 5. Collected data types.

	Types of Data
Expert-tuned LLM	HDL code generation pass rate
Student grades	Previous class students’ course grades and prerequisite course grades
Student grades	218 students’ course grades and prerequisite course grades
TA perspective	Relationship between student participation and course progress

Table 6. Pass rate of HDL code generation across different LLMs.

Models	Pass@1	Pass@5	Pass@10
Ours	60.7%	78.5%	88.4%
ChatGPT-4o	47.9%	75.2%	78.7%
DeepSeek	27.9%	32.2%	55.6%
Gemini 1.5 Pro	47.9%	65.9%	79.3%
Grok 1.5	42.9%	56.6%	58.8%

Table 7. Statistical characteristics of score changes across assistant modes compared to previous class’ independent design.

Assistant Modes	Max	Min	Mean	Std.	Proportion by Score Change Range (%)
Assistant Modes	Max	Min	Mean	Std.	<−10	[−10, −5)	[−5, 0)	[0, 5)	[5, 10)	≥10
Previous class (independent)	25.70	−10.00	−0.21	4.78	0.00	10.73	44.63	37.85	4.52	2.26
TA-only	45.40	−9.60	4.11	8.58	0.00	2.56	17.95	48.72	20.51	10.26
Generic-LLM-only	25.50	−6.80	0.44	5.18	0.00	6.52	43.48	43.48	4.35	2.17
Expert-tuned-LLM-only	13.00	−9.80	0.07	5.40	0.00	18.60	37.21	27.91	11.63	4.65
TA + Generic-LLM	16.40	−6.90	4.19	4.73	0.00	2.17	8.70	50.00	28.26	10.87
TA + Expert-tuned-LLM	22.10	−3.80	3.74	5.92	0.00	0.00	22.73	47.73	18.18	11.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Yang, B.; Huang, S. Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses. Educ. Sci. 2025, 15, 1443. https://doi.org/10.3390/educsci15111443

AMA Style

Liu Z, Yang B, Huang S. Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses. Education Sciences. 2025; 15(11):1443. https://doi.org/10.3390/educsci15111443

Chicago/Turabian Style

Liu, Ziqi, Bolan Yang, and Shizhen Huang. 2025. "Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses" Education Sciences 15, no. 11: 1443. https://doi.org/10.3390/educsci15111443

APA Style

Liu, Z., Yang, B., & Huang, S. (2025). Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses. Education Sciences, 15(11), 1443. https://doi.org/10.3390/educsci15111443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring the Impact of Different Assistance Approaches on Students’ Performance in Engineering Lab Courses

Abstract

1. Introduction

2. Literature Review and Motivations

2.1. Literature Review

2.2. Motivations

3. Methodology

3.1. Background of Lab Course

3.2. Method Overview

3.3. Framework of Redesign Course

3.4. Expert-Tuned LLM Design

3.5. Participants

3.6. Data Collection

4. Results and Data Analysis

4.1. Results for Proposed LLM

4.2. Results for Student Grades

4.3. TA-Collected Observations

5. Conclusions, Limitations, and Future Works

5.1. Conclusions

5.2. Limitations

5.3. Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI