1. Introduction
Understanding how different types of instructional content impact student learning and arousal is critical in this era of advanced artificial intelligence (AI). In recent years, Large Language Models (LLMs) such as ChatGPT have prompted critical debate on the way educational content is generated and consumed [
1]. These models can produce extensive, coherent, and contextually relevant text, raising questions about their effectiveness compared to traditional, human-written educational materials.
Prior studies have highlighted the potential of LLMs to provide tailored learning experiences for individual students, allowing students to participate in a curriculum adaptive to their respective paces of learning [
2]. Other studies highlighted the potential of LLMs in aiding problem solving, critical thinking and motivation of students, as demonstrated by Vazquez-Cano et al. (2021) and Kim and Lee (2023) [
3,
4]. However, the discourse on the effectiveness of Generative AI also brought opposing notions. According to a study in 2025 by Gerlich, it was proposed that as people rely more on AI technology and for an extended time period, they are less likely to engage in critical thinking as they could resort to AI for cognitively straining tasks [
5]. This behavior—referred to as cognitive offloading—is detrimental and ineffective in boosting learners’ efficacy. Another study by Abbas et al. (2024) [
6] also corroborated this effect of Generative AI. It stated that students who face higher pressure (time, academic workload) were more likely to rely on AI [
6]. Subsequently, prolonged and repeated usage and reliance on AI led to memory loss and delay of work. This two-sided argument is aptly summarized in a meta-analysis by Huang et al., which stated that the effects of LLMs on students can be positive, although uneven and applied on a small scale. Furthermore, the study also highlights that the nature of these effects depends heavily on how student agency and participation are scaffolded [
7]. The ongoing discourse and vast parameters further highlight the need for empirical evidence on the effectiveness of AI and LLMs.
Most studies on this topic study the effect of LLMs on learning over long periods of time (e.g., over months at a time) through post hoc means. As seen in previous works, longitudinal studies are primarily focused on presenting cumulative effects of studying using AI, for example, cognitive offloading. In contrast, there exists a relative dearth of research on in situ measures of the effect of LLMs on learning performance. Studying the effect of LLMs in the short term opens new avenues for research and explanation, as more data can be collected from participants. Most notably, short-term studies can allow for research on more granular interactions between education and AI with suitable control of external factors. This, in turn, can complement what is suggested by existing studies on the long-term usage of LLMs in learning new knowledge. To summarize, within this context, while longitudinal studies confirm broader trends and the effects of reliance on AI, short-term studies provide granular, context-specific insights through case studies that maintain verisimilitude with real-life scenarios, which longitudinal studies fail to capture (e.g., short-term/one-time querying about a topic or new information). Within these parameters, this study also adopts a distinct approach by incorporating biometric data.
Another aspect of interest is the link between emotions and learning. It is widely known that affective states have a significant influence on efficiency, productivity and behaviors. Learning is no different, as emotions are intertwined with every aspect of the learning process [
8]. To elaborate on this, positive emotions, such as joy, encourage the learning process, while negative emotions, such as fear, inhibit the learning process [
9]. As such, an investigation into the emotions that occur during learning from LLMs could allow us to have a deeper understanding of the learning process when using LLMs. With regards to this, Russel’s model can give a general estimate of affective state using a two-dimensional scale of emotional arousal and valence [
10]. In our experiment, we assume affective state to be the marker of the effectiveness of obtaining knowledge via LLM and curated text.
In the study, we investigate the near-term effects of LLMs on learning performance via the analysis of Electrodermal Activity (EDA). EDA, which measures the electrical conductance of the skin in response to sweat secretion, is intricately linked to the sympathetic nervous system and offers valuable data on mental and physiological states [
11]. Hence, EDA is an important tool for assessing physiological responses due to its ability to provide non-invasive insights into human stress and emotional arousal. This provides almost real-time insights into the cognitive and emotional stresses of learning during the short term. Compounded with the fact that students are more likely to react strongly to new information for the first time, more rapid fluctuations in EDA peaks and troughs are likely to be captured. By capitalizing on this, short-term studies can provide more meaningful empirical evidence, which longitudinal studies fail to capture as students gradually become more acclimatized to LLM and the subjects at hand.
In summary, the rapid adoption of AI-generated content in education necessitates a deeper understanding of its impact on students’ cognitive and emotional states. By using EDA as a non-invasive method to monitor students’ stress and arousal during learning activities, this research aims to provide empirical data on how AI-generated texts compare to human-written texts.
This research involves the use of low-cost, wearable devices, such as a biometric wristband, to measure EDA alongside other physiological indicators in students. These devices have been validated against research-grade equipment to ensure accuracy and reliability [
12]. The research involves a controlled experiment where students engage with educational content produced both by LLM and human authors, allowing for a direct comparison between learning from AI-generated texts and human-written texts. This approach allows for a nuanced analysis of the interplay between educational content and student arousal.
Our research seeks to demonstrate the potential of EDA data in enhancing our understanding of educational practices. By comparing the physiological impacts of AI-generated and human-written texts, we aim to contribute valuable insights that can inform the design of future educational tools and strategies. The findings could have significant implications for the integration of AI in educational settings, ultimately leading to improved learning experiences for students.
2. Background
Critchley and Nagai (2012) [
13] define electrodermal activity (EDA) as changes in skin resistance or electrical potential due to neural effects on sweat glands. Comprising tonic (skin conductance level, SCL) and phasic (skin conductance response, SCR) elements, EDA is closely linked to human stress and emotions, and can be measured non-intrusively [
11]. This makes it a practical tool for assessing stress and emotions outside of laboratory settings.
On the other hand, according to a study by Leblanc and Posner in 2022, learning consists of both cognitive factors and emotional ones [
14]. Important factors in learning, such as attention [
15] and memory [
16], have shown ties to emotional variations. Given that both EDA and learning have deep correlations to the same physiological factors, it is likely that a relationship exists between the two.
Analysis of an EDA signal requires the calculation of certain key features. In this study, we use a wide range of features in order to study potential differences in EDA across various learning methods.
This study used the number of non-specific skin conductance responses (NSSCR) per minute, as previous research links NSSCR increases to emotional arousal [
17] and, when combined with negative valence, to emotional stress [
18].
The study also included Mean Skin Conductance Level (SCL) and Skin Conductance Response (SCR), which reflect tonic and phasic changes in skin conductance, respectively. Research has found that SCL is an indicator of cognitive load [
19], while the amplitude of SCR tends to rise with increased emotional valence when individuals view visual stimuli [
20].
TVSymp is a sensitive metric of sympathetic activity derived from time-frequency spectral analysis of EDA data [
21], and has demonstrated high sensitivity to orthostatic, cognitive, and physical stress [
22]. EDASymp, based on power spectral analysis [
23], also effectively detects stress, which is comparable to traditional time-domain measures like SCL and NSSCR.
Based on the preceding arguments, we hypothesize that LLM participants will show higher values for each of the five features (NSSCR, Mean SCL, Mean SCR, TVSymp, EDASymp).
3. Methodology
3.1. Materials Used
A total of 44 participants were recruited through convenience sampling from the peer network of the authors. The participants were aged 13 to 16 with an equal mix of genders.
Sustainability education was chosen as the topic to be learned via the two learning methods, as it was a topic most participants were not familiar with. In this way, it was less likely that bias could occur if participants had prior knowledge about the topic.
A passage of prose about the topic was prepared without the assistance of Generative AI prior to the experiment. It was prepared in the tone of an academic article. The handout is included as
Appendix A.
For AI-generated text, we used ChatGPT 4o API as the LLM with our own custom Web App for easier experimentation. The system prompt given to ChatGPT is included as
Appendix B. We used the CO-STAR framework when designing the prompt because the response given by the LLM would be helpful and relevant to the topic [
24]. The human-written text was included in the prompt as a guide.
3.2. Data Collection
EDA data were collected from participants using DIY sensors designed and assembled based on [
25]. The EDA circuit follows DC-EXM (exosomatic methodology), where an external current of 3.3 V is applied for measurement of skin resistance, with a sampling rate of 51.2 Hz. Further information and validation of the sensors against research-grade equipment can be found in [
12]. Our design of this DIY EDA device follows the design principle of wearables from the company Shimmer, which has been used in numerous experiments that yielded positive results.
The sensor was attached to participants’ non-dominant hand, as shown in
Figure 1, to allow unrestricted and comfortable movement of the dominant hand during the activities described above. Participants were additionally instructed to minimize movement of the non-dominant hand in order to reduce motion-related data artifacts. This placement was chosen primarily for convenience, as electrical skin conductance is theoretically comparable between both hands. Dry Ag/AgCl electrodes of diameter 0.80 cm were used for the electrodes for measurement of EDA. For the PPG sensor, a pre-assembled model was used (XD-58C) with a sampling rate of 50 Hz.
Data collected during the experiment was split based on the timings outlined in the procedure and analyzed following the processes outlined in the Data Analysis portion below. All data was anonymized, and no personally identifiable information was collected.
In addition to EDA data, we also wished to investigate the efficacy of each learning method, as well as the mental states of the participants after learning from each learning method. A quiz was used to achieve the former, while Self-Assessment Manikin (SAM) evaluations were used to achieve the latter. SAM is a “non-verbal pictorial assessment technique that directly measures the pleasure, arousal, and dominance associated with a person’s affective reaction to a wide variety of stimuli” [
26]. SAM was chosen for its effectiveness in directly assessing the pleasure, arousal, and dominance associated with response to the event, which reduces the resources needed to measure other types of variables for higher resolution, according to Bynion and Feldner [
27]. Furthermore, SAM can be used universally as language barriers can be overcome when emotions are represented by pictures [
28].
To assess participants’ understanding of the topic after learning, a quiz was prepared prior to the experiment. The quiz consisted of six MCQs (Multiple Choice Questions) and a SAM evaluation. The quiz was administered to participants via a physical hardcopy. To guide participants towards getting the correct answers for the MCQs, the scope of the questions was included in both the human-written text handout (
Appendix A) and the LLM prompt (
Appendix B).
A separate hardcopy was also prepared (SAM evaluation), which only includes the SAM evaluation.
3.3. Procedure
The flow of the experiment follows the schedule in
Figure 2 below.
Sensors were attached to the fingers of the participant’s non-dominant hand before the experiment and removed only after it ended.
The experiment measured baseline EDA at the start, after the quiz, and at the end, alongside EDA during learning. This enabled normalization between participants. Baseline periods ranged from 120 to 180 s (2–3 min).
The experiment includes two 8-min learning windows for participants to learn about the topic via one of the two learning methods (AI-generated text/human-written text). For human-written text, the hardcopy handout (
Appendix A) was given out. No annotation was allowed. Participants each used a laptop to interact with the LLM (see
Appendix B), asking additional questions as needed. Their EDA was monitored throughout for analysis of changes during learning. Any shift in sentiments between learning methods would then be captured through EDA data and self-reported data through the questionnaires.
All participants had to learn from both methods. The order in which the participants learned from the different learning methods was randomized among participants. While the asymmetry between the LLM-based learning condition and the curated-text condition is acknowledged, this design choice was intentional and aligned with the primary objective of the study, which is to compare traditional self-directed text-based learning with learning mediated by LLMs. Although classroom learning typically allows opportunities for clarification through teacher interaction, replicating such facilitation would introduce additional uncontrolled variables, including variability in question type, timing, relevance across participants, and potential disruptions to reading flow and experimental duration. In the LLM condition, participants were not required to engage in deep or exploration questioning; instead, they retained the option to request information verbatim, closely approximating textbook reading. Any further engagement with the LLM was therefore contingent on individual participant attitudes, behaviors, and emotional states at the time of learning. This design also reflects a realistic self-learning scenario in which students study independently without direct instructional support. Within this context, the study seeks to examine whether access to an LLM confers learning or affective advantages over static text, and to what extent such differences can be quantified and explained using features derived from electrodermal activity (EDA) data.
After the first learning window, the quiz was administered to participants, regardless of the learning method they used. The resultant score of the quiz serves as a measure of participants’ understanding of the topic. This was analyzed to quantify the efficacy of the learning method. The SAM evaluation in the quiz also allowed us to record the participants’ mental and emotional status.
After the second learning window, a second SAM evaluation was performed. This allowed for the analysis of mental and emotional status after each of the learning methods.
3.4. Data Analysis
The EDA data was processed in Python 3.12, resampled to 25 Hz, and outliers (absolute z-score > 3) were removed. Data normalization and a low-pass Butterworth filter (1.5 Hz, 8th order) reduced artifacts, or unwanted electronic noise [
29]. Further artifact detection and removal followed the LSTM-CNN approach by [
30].
The data was segmented into 2–3 min windows matching the baseline length to confirm that EDA feature changes persisted during the learning phase.
Five EDA features were extracted per window: NSSCR/min, SCR, SCL, TVSymp, and EDASympn. NSSCR/min was calculated using a peak detection algorithm from [
31]. SCR and SCL were derived via cvxEDA by Greco et al. (2016), separating tonic and phasic components [
32]. TVSymp was computed as the mean spectral amplitude in the 0.08–0.24 Hz band using methods from [
21]. EDASympn resulted from downsampling to 2 Hz and applying Posada-Quintero et al. (2016b)’s normalized power spectral density approach [
23].
Data from the LLM and Text learning phases were normalized by subtracting baseline values. Each feature was tested with a clustered Wilcoxon signed-rank test using the Rosner–Glynn–Lee (RGL) method via the ‘clusrank’ package in R [
33], accounting for clustering caused by segmentation.
Data from SAM is compared row-by-row, while data from the quiz is compared directly between the groups. A Wilcoxon signed-rank test was used for both to account for the non-normality of the data. The clustered Wilcoxon test was implemented in R using the clusrank package, specifically employing the Rosner–Glynn–Lee (RGL) method for clustered paired data. Participants were treated as independent clusters, with multiple observations nested within each participant. The RGL method adjusts the variance of the Wilcoxon test statistics to account for within-cluster dependence, thereby providing valid inference under correlated observations. For comparison, standard simple paired Wilcoxon signed-rank tests were additionally conducted as a robustness check. The results of the standard paired Wilcoxon signed-rank tests can be found in
Appendix C.
4. Results
From a total of 44 participants, the final dataset comprised 23 samples of EDA data (with 21 samples discarded due to hardware faults), 42 samples of quiz results, and 39 samples of SAM evaluations (two samples and five samples discarded, respectively, due to participant incompletion of tasks). Data loss resulted from logistical constraints rather than participant-related factors. Consequently, the missing data are assumed to be missing at random (MAR). Under this assumption, additional sensitivity analyses were not conducted.
Figure 3,
Figure 4,
Figure 5,
Figure 6 and
Figure 7 below present the respective Kernel Density Estimation plots for each of the five features under investigation, depicting learning from curated text versus learning from Large Language Models.
Since every feature extracted from EDA data collected when participants were learning using an LLM is more right-skewed for everything but SCL, we tested for the alternative hypothesis that the normalized EDA data for LLM is of a greater magnitude than that of Text.
Table 1 depicts how a clustered Wilcoxon signed-rank test showed that individuals using LLM did elicit a statistically significant (
p < 0.1) increase in normalized SCR in comparison to individuals using Text (Z = 1.3163,
p-value = 0.09404). Conducting the same test on the rest of the features (NSSCR, TVSymp and EDASymp), the results showed no statistical difference between LLM and curated text, based on the level of significance of
p < 0.1.
For SCL, we tested for the alternative hypothesis that normalized EDA data for LLM is of a lesser magnitude than that of Text.
A clustered Wilcoxon signed-rank test showed that individuals using LLM did elicit a statistically significant decrease in normalized SCL in comparison to individuals using Text (Z = −1.3122, p-value = 0.09473).
For the test, on average, individuals performed better (Median = 4.0) using an LLM, compared to those who used Text (Median = 3.0). A Wilcoxon signed-rank test showed that this improvement is statistically significant (W = 30.0, Z = −2.04289, p-value = 0.02053).
Due to the exploratory nature of this study, this more lenient
p threshold was adopted.
Table 2 depicts how effect sizes indicated small associations for SCR (
r = 0.25) and SCL (
r = −0.25), and a moderate association for quiz performance (
r = −0.49). Confidence intervals were wide, further reflecting limited power, and results should be interpreted as preliminary.
Table 3 depicts how for the SAM, on average, individuals have the same response.
Table 4 depicts running a Wilcoxon signed-rank test on the values.
As p-values from the Wilcoxon signed-rank test are all higher than the level of significance, there is insufficient evidence to conclude that there is any difference between the SAM results.
Interestingly, on average, it was found that quiz performance after learning via LLM was better than curated text, with LLM having a higher median and minimum score compared to curated text, as seen in
Figure 8 below.
5. Discussion
Through this study, we have analyzed the differences in EDA between LLM and Text in learning, aiming to find differences in learners’ cognitive and emotional stresses.
Given that NSSCR increases with emotional arousal, we can infer that learning from the LLM results in a higher emotional arousal than learning from Text. However, compared to the self-reported emotional arousal from the SAM, there seems to be an inconsistency for any concrete conclusion. This can be due to the discrete values of the SAM scale, externalities during the experiments or variations in preconception, learning styles and disciplines (e.g., how students felt towards the topic of sustainability). Nonetheless, the inference of SCR and SCL (more positive emotional valence and lower cognitive load) follows the results of the study by Sun and Zhou (2024), which suggested that Gen-AI could be extremely beneficial for students’ learning [
34]. Our results also provided empirical evidence, which is backed by another research by Wu and Yu (2024), which suggested that AI could enhance learning outcomes in the short term, for it can boost students’ confidence and motivation [
35].
In contrast, there are studies that state otherwise. According to a study in 2025 by Gerlich on AI reliance and critical thinking, it was proposed that as people rely more on AI technology, they are less likely to engage in critical thinking, as they could resort to AI for cognitively straining tasks [
5]. This behavior—referred to as cognitive offloading—is detrimental and ineffective in boosting learners’ efficacy. As for our results from the SAM, there was virtually no difference between any dimension of the sentiment scale, which indicates that there are no significant improvements that Generative AI brings in terms of learning new knowledge.
Furthermore, our results suggest that while there are differences in the students’ engagement and stress with educational content delivered between LLMs and curated text, the only statistically significant difference was emotional arousal (p = 0.094 via analysis of SCR) and cognitive load (p = 0.095 via analysis of SCL). Hence, one significant finding is that LLMs elicited more positive emotions while Text needed more memory from learners. Therefore, to further strengthen this effect, learners should aim for a higher degree of agency and active involvement with AI by using prompts that elicit active engagement (and thus, emotional response). For example, rather than asking for direct answers, learners can ask gen AI for more thought-provoking ideas that challenge their understanding of the topic. With a lower cognitive load, learners are less stressed and can therefore absorb information better. This corroborated the importance of utilizing AI correctly—careful scaffolding to provide learner agency and participation. While these are revealing insights with respect to observable differences in the effectiveness of Generative AI and curated text, with the overall landscape of our results, it is inconclusive whether Generative AI would be a superior alternative to curated text in terms of learning. Thus, we recommend that Generative AI should still be used in moderation and should not be the sole method that learners rely on.
To elaborate on this, our research supports the suggestions laid out by the UNESCO AI Competency Framework for Teachers. As Generative AI in our research has been shown to be only marginally better and have little to no additional perceived effectiveness in comparison to traditional means of learning, teachers’ role in education cannot be replaced, as they represent an accurate source of information, and Generative AI remains only a tool to complement existing methods of studying. There are also several other implications of the use of AI in teaching and learning. Firstly, the use of AI must be sustainable, and stakeholders must think critically before incorporating AI, for the over-reliance on AI can bear detrimental consequences on learners’ attributes. Furthermore, AI should be approached as a means to empower teachers and students’ roles in education. Teachers are still positioned as the main facilitators who guide students in discerning the knowledge given by AI outputs versus traditional texts or sources of information. This aligns with ensuring human agency and critical thinking skills remain central in learning. Finally, a balance should be struck between the roles of humans and AI in cultivating and teaching knowledge. This is because while AI can enhance accessibility and explain content in a personally tailored manner, the initial source of information should be curated carefully and accurately by a human. Overall, moving forward, the use of AI in education must ensure greater safety, accountability, and inclusivity. Additionally, teachers must also constantly keep up with the emerging capabilities of AI to effectively harness its power for education.
We readily acknowledge the following limitations to our approach. Firstly, the arbitrarily chosen level of significance of 0.1 for clustered and normal Wilcoxon signed-rank tests might still be too high to prove the statistical difference within the data of each data feature. Furthermore, one of the inherent weaknesses of EDA data is its low resolution, despite the ease of collecting data. Furthermore, while features extracted from EDA data can be reliable indicators of emotional arousal, it is less dependable as a means to analyze emotional valence. To mitigate this, the sensors were placed on the fingers of the non-dominant hand of each participant, in accordance with the protocols from previous studies [
25,
36]. We also assumed participants learned at a uniform pace, and the fixed time-windows within the procedure were a consequence of this assumption. We acknowledge this may have masked individual differences in engagement, effort, and emotional responses. Hence, in the future, alternative sources of data can be explored to complement existing ones, such as employing Electroencephalography (EEG) data or Facial Action Units (FAU). By using various sources of data and cross-checking the results of affective states and mental load, a more conclusive result can be obtained regarding the effectiveness of Generative AI compared to traditional learning means, such as curated text. We also acknowledge the shortcomings of our samples in terms of quantity, age, gender and nationality. Due to resource constraints, participants during the text-learning condition were not afforded access to a human learning partner; we acknowledge the asymmetry in conditions, but also draw parallels between self-study learning when no human pedagogues are present. Given the limited number of samples, age range (13–16) and the significant differences of cognitive, emotional and behavioral patterns across various age groups, the results can only be exemplary of this group of adolescent learners. In the future, these external factors will be explored for greater scope and robustness of information.
In future, as more advanced LLM models are trained and compositional learning is better demonstrated, we can expect to see more drastic changes in the effectiveness of Generative AI in facilitating students’ learning new knowledge. This is because future Generative AI models will be able to comprehend human language and images better, which further refines their response to the users’ prompt [
37].
6. Conclusions
By way of tentative conclusion, with the rise in popularity of Generative AI as a tool to assist teaching and learning, our study contributes valuable empirical evidence to elucidate the degree of effectiveness of Generative AI compared to traditional methods of learning, such as curated text. This is especially crucial because ample, data-driven evidence is needed for teachers, policymakers, and other relevant stakeholders to make informed and grounded decisions regarding the planning, adoption, and implementation of Generative AI in education. Our findings are particularly significant in the current context, where there is growing interest in integrating AI tools into educational practices, often without sufficient critical evaluation of their actual benefits and limitations. As such, we aim to complement existing studies on Generative AI and education that employ observational and qualitative methods.
In our study, we employed EDA features and self-reported data as measures of learning effectiveness and conducted statistical analyses to draw meaningful observations. To summarize, we found that learning with a large language model (LLM) results in greater Skin Conductance Response (p = 0.09404), indicative of more positive emotional arousal, and lower Skin Conductance Level (p = 0.09473), suggesting reduced cognitive load, compared to curated texts. Moreover, our data revealed that learning with an LLM correlates with higher quiz performance (p = 0.02053), highlighting a potential benefit of Generative AI in enhancing test-based outcomes. However, as the rest of the features (NSSCR, TVSymp, and EDASymp) showed no statistically significant differences between learning via LLM and curated texts, our initially stated hypothesis was rejected.
Furthermore, concerning the results yielded by the SAM (self-reported data), participants generally perceived that LLMs did not significantly aid their learning compared to curated texts when acquiring new knowledge. Taken together with the rejected hypothesis, we conclude that while Generative AI may show slight advantages in specific areas, its overall benefit in learning new knowledge is marginal compared to traditional study methods like curated texts. Likewise, while the findings suggest that AI usage improves well-being, the methodology’s limitations—such as the non-representative sample and the specific context of the study—mean that caution should be exercised when applying these conclusions to broader populations or different settings. This suggests that Generative AI is not a definitive substitute for curated educational materials but rather a complementary tool that needs careful contextualization in learning environments.
Based on these caveats and results, we emphasize the importance of continuing research in this area to build a more comprehensive understanding of the effectiveness (or otherwise) of Generative AI in teaching and learning. Given that our work is an exploratory and preliminary study, future studies should aim to refine methodologies, incorporate diverse measures of learning outcomes, and explore how specific contexts, subjects, or demographics may influence the utility of Generative AI. With a strong foundation of empirical evidence, education stakeholders will be better equipped to integrate AI tools responsibly and effectively into teaching and learning practices.