Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study

Lazic, Milan; Woodruff, Earl

doi:10.3390/electronics15102059

Open AccessArticle

Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study

by

Milan Lazic

^* and

Earl Woodruff

Ontario Institute for Studies in Education, University of Toronto, Toronto, ON M5S 1A1, Canada

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(10), 2059; https://doi.org/10.3390/electronics15102059

Submission received: 21 April 2026 / Revised: 5 May 2026 / Accepted: 9 May 2026 / Published: 12 May 2026

(This article belongs to the Special Issue AI-Driven Advanced Signal Processing: Theory, Methods and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Large Language Models are increasingly being used by students in academic contexts, but they can only evaluate and engage with what students express in language. The feeling of understanding is inaccessible to them directly. This matters because the feeling of understanding shapes how students judge their understanding and guides their learning. Feelings have a physiological basis and can therefore be measured through facial action units. This study explored whether action unit patterns are associated with nascent understanding, misunderstanding, confusion, emergent understanding, deep understanding, and underconfidence as 198 participants completed a literary analysis task while their facial expressions were recorded over Zoom. CatBoost and logistic regression models demonstrated limited ability to discriminate phases at the population level, and within-person differences between phases were modest and inconsistent across participants. The findings highlight the difficulty of measuring the feeling of understanding in naturalistic academic contexts and may suggest that the feasibility of AU-based phase detection depends in part on the extent to which phases can be specified with temporal and conceptual precision.

Keywords:

understanding; facial action units; machine learning; literary analysis; large language models

1. Introduction

To what extent do Large Language Models (LLMs) help students develop understanding? This is an important question given the increasing use of LLMs in academic contexts [1,2] and the central role of understanding in learning [3]. The answer depends partly on what understanding means and what LLMs are capable of. When these are considered together, a fundamental limitation becomes apparent.

Understanding can be described in two ways. The first treats understanding as propositional. That is, as something that is expressed and recognized through language. David Olson [3,4] offers a clear account of understanding in these terms. Drawing on philosophy, psychology, and linguistics, he argues that understanding is best characterized as a linguistic practice governed by shared criteria for correct use. To understand something, on this view, is to know when it is appropriate to say one understands, using publicly available standards such as truth, coherence, and relevance. It depends on possessing the concepts and grammar that allow one to ascribe understanding to oneself or to others. One does not first understand and then describe that understanding in words. Rather, understanding consists of knowing how to use the concept understand correctly within a shared linguistic community. Olson emphasizes that this ability is not acquired through introspection. People come to realize what it means to understand by learning the conditions under which claims of understanding are accepted, questioned, or withdrawn. In this sense, understanding is something one can claim, justify, and revise. It is inherently public and accountable.

Olson [3] also describes the feeling of understanding as the lived, embodied sense that something makes sense. This feeling plays a central role in how individuals judge their understanding and guide their learning. When someone feels they understand, they are more likely to persist, apply what they know, and act with confidence. When that feeling is absent, they are more likely to hesitate, seek clarification, or revise their thinking. Importantly, the feeling of understanding can come apart from propositional understanding. A person may hold a position with strong conviction while relying on concepts that are false or inconsistent, or may explain an idea accurately yet lack confidence or feel unable to move forward. The feeling of understanding therefore reflects how knowledge is being integrated and used in ways not captured by what someone says alone. From a humanistic perspective on learning, this matters considerably. Dewey argued that learning reorganizes experience, changing how individuals relate to the world [5]. Bruner similarly emphasized that learning involves seeing relationships among ideas in ways that support judgment and action [6]. On these accounts, understanding is not complete when a correct answer can be given, but when the learner experiences a shift in how something makes sense and can be used to orient thinking and behaviour.

The feeling of understanding is not uniform but likely varies across distinguishable phases through which understanding develops [7,8]. Six such phases can be identified: nascent understanding, misunderstanding, confusion, emergent understanding, deep understanding, and underconfidence. These phases do not form a fixed or linear sequence; learners may move between them, revisit earlier ones, or bypass others as their understanding develops. Consider, for example, a learner encountering a new concept for the first time. In nascent understanding, they lack a coherent framework for interpreting it. Information may be available to them, but it is not yet organized in a way that supports determining what is relevant or how ideas are related. As engagement continues, misunderstanding may emerge, in which the learner constructs an interpretation that feels right but is conceptually incorrect or incomplete. Confusion may arise when this interpretation begins to break down. New information, counterexamples, or failed applications expose inconsistencies in the learner’s understanding, producing cognitive tension or disequilibrium. If the learner persists, emergent understanding may follow. In this phase, the learner revises their interpretation by identifying relationships among ideas and integrating them into a more coherent account. Understanding improves, but it remains tentative. Deep understanding reflects a more stable and integrated grasp of the concept. At this stage, the learner can apply ideas flexibly, explain relationships between them, and recognize when and why they are appropriate across contexts. The phase of underconfidence completes the picture: the learner may arrive at a correct interpretation yet feel uncertain about it, such that their understanding is more developed than it feels. Each of these phases is cognitively distinct and likely accompanied by a characteristic feeling that shapes how the learner thinks, behaves, and proceeds in their learning.

The distinction between propositional and felt understanding reveals what LLMs can and cannot do [9,10]. LLMs are well-suited to engaging with propositional understanding. Because they are trained on vast amounts of text, LLMs implicitly learn how understanding is expressed and recognized in language within specific domains and tasks [11]. They learn what counts as an acceptable answer and explanation, and how coherence and relevance are signalled. As a result, systems such as ChatGPT can effectively engage with and assess this form of understanding. However, since these systems only operate through linguistic input and output, they cannot access how understanding is felt and experienced directly [3]. In other words, these systems can only access the feeling of understanding through a medium that both alters and obscures it. As a result, LLMs may judge responses as correct or support productive learning behaviours at face value, while remaining insensitive to whether understanding is tentative, misplaced, or absent. This can allow students to appear competent while bypassing the integration that gives learning meaning and durability [12], or lead LLMs to move students forward when consolidation or reflection is needed. This limits the degree to which LLMs can support understanding.

Addressing this limitation requires a measure of understanding beyond self-report, which depends on how well someone can access and put their experiences into words, and interrupts the process it seeks to measure [13,14]. One approach draws on the somatic marker hypothesis, which is most closely associated with the work of Antonio Damasio. Damasio [15] argues that feelings arise when physiological changes are represented in the nervous system and enter conscious experience. This view belongs to a broader tradition, from William James’s early claim that feelings arise from the perception of bodily changes [16] to more recent work in affective neuroscience identifying how information about the body’s physiological state is integrated into conscious experience [17,18]. Together, this work indicates that feelings have a physiological basis and can therefore be measured through physiological activity. Facial expressions are one way such activity manifests and may be decomposed into action units (AUs) corresponding to specific muscle movements [19]. Decades of research show that AUs track changes in arousal, effort, and psychological states [20,21,22] and can be measured unobtrusively in real-time using tools such as OpenFace [23].

In learning contexts, most AU-based work relevant to understanding has focused on confusion rather than understanding itself. Confusion has been examined as a cognitive-affective state that can arise during complex learning, often in relation to disequilibrium, impasse, and uncertainty [21]. Facial expression studies have linked confusion and closely related learning states to AUs such as brow lowering and lid tightening, particularly AU4 and AU7 [24,25,26]. This literature provides an important foundation for the present study because confusion is one phase in the development of understanding. However, it does not establish whether facial activity can distinguish broader phases of understanding, such as nascent understanding, misunderstanding, emergent understanding, deep understanding, or underconfidence.

Although this line of reasoning supports the possibility that phases of understanding may have a measurable physiological basis, constructionist theories caution against assuming that any psychological state has a single facial or physiological signature. From this perspective, categories such as confusion or understanding are not expressed through fixed bodily patterns. Rather, they are constructed from variable bodily, contextual, and conceptual processes, such that the same psychological category may be realized differently across individuals and situations [27,28]. This view does not imply that facial activity is irrelevant to the feeling of understanding. Instead, it suggests that AU patterns may be probabilistic, context-dependent, and person-specific rather than invariant. The extent to which phases of understanding can be detected through facial activity may therefore depend on the context in which they are studied. In well-defined tasks, phases may unfold in ways that are more temporally distinct and conceptually separable, making them easier to operationalize and observe. In more open-ended tasks, however, learners may move more gradually among uncertainty, partial insight, reinterpretation, and confidence, such that phases overlap rather than appear as clearly bounded states. If so, the precision with which phases can be identified may influence the extent to which corresponding physiological distinctions can be detected.

A recent study by Lazic et al. [7] provided initial evidence that AU patterns can distinguish phases of understanding. In that study, participants worked through riddles while their facial expressions were recorded on video. Phases were operationalized through a combination of answer correctness, certainty ratings, response latency, and observational coding, and AU patterns associated with each phase were identified using supervised machine learning. Class separation was high across phases, suggesting that the feeling of understanding may have a measurable physiological basis that varies across phases of understanding. At the same time, the task used in that study was well-defined, which may have allowed phases to be identified with greater temporal and conceptual clarity than would be possible in more open-ended academic contexts.

Moreover, the study has three important limitations. First, the data were cleaned to minimize noise, leaving it unclear how well the observed class separation would hold in more naturalistic settings. Second, the phase of misunderstanding was not analyzed. Finally, the analyses were based on population-level patterns. Given ongoing debates about whether psychological states have physiological “fingerprints” or vary across individuals and contexts [27], a within-person analysis is needed to determine whether AU patterns remain distinct across phases within individuals or show meaningful overlap.

The present study aims to address these limitations and extend the exploratory work of Lazic et al., by examining whether AU patterns distinguish phases of understanding as participants work through a literary analysis task. In doing so, the study tests not only whether phase-related AU patterns generalize to a more naturalistic academic context, but also whether such detection remains feasible when phases are less sharply bounded than in well-defined problem-solving tasks. This is important because the precision with which phases can be identified may affect the extent to which physiological distinctions can be observed. Accordingly, the present study provides an opportunity to examine both the promise and the limits of using facial activity to measure the feeling of understanding.

2. Materials and Methods

2.1. Participants

Participants were recruited through University of Toronto classrooms, student-affiliated websites (e.g., the University of Toronto Psychology Student’s Association), online platforms (e.g., Facebook), and word-of-mouth referrals. Most participants (74%) were undergraduate students from the University of Toronto tri-campus. All recruitment procedures were approved by the University of Toronto Research Ethics Board.

2.2. Procedure

The study was conducted remotely via Zoom. Before beginning, participants were given instructions to ensure adequate video quality. They were asked to position themselves directly in front of their webcam with sufficient lighting, avoid facial obstructions, use a stable surface for their computer, refrain from using phones or tablets, and maintain a neutral background without movement or visual filters, among other instructions.

Participants then accessed a Qualtrics link to complete informed consent and a brief demographic questionnaire. After this, they read The Road Not Taken by Robert Frost and worked through eight questions. Participants were instructed to work through the questions silently to avoid displaying facial expressions unrelated to the task. Participants shared their screens throughout the task to monitor progress. All sessions were conducted by the first author using a standardized script and neutral feedback to ensure consistency across participants.

Participants had an indefinite amount of time to work on each question. The correct answer for each question was provided aloud after all eight questions were completed, if participants wished to know them, because providing the correct answer after each question may have influenced participants’ responses to subsequent questions. Before moving on to the next question, participants were asked whether they already knew the answer to the current question so that these trials could be excluded from analysis.

Video recording began when participants started reading the poem and ended after they completed the last question. To minimize potential influence on participants’ behaviour, the first author’s video and audio were disabled during recording. After completing the task, participants were debriefed and entered a draw to win one of two CAD 500 gift cards.

2.3. Measures

2.3.1. Phases of Understanding

Six phases of understanding were measured using eight questions about Robert Frost’s poem The Road Not Taken (Appendix A). A literary analysis task was selected for this study’s task because it reflects a common learning situation and supports multiple interpretations varying in depth, enabling the elicitation of all phases of understanding from a single text. In contrast, a closed-ended task such as solving math problems would require several distinct problems to vary in difficulty and elicit all the phases. Controlling for prior knowledge in such tasks would also be more challenging because performance would depend heavily on participants’ baseline knowledge, increasing the risk that some would not be sufficiently challenged while others would be overwhelmed. In comparison, literary analysis depends less on specialized procedural knowledge, making differences in baseline proficiency less pronounced. The Road Not Taken was specifically selected because it is widely known and frequently misinterpreted, making it well-suited for eliciting misunderstanding. In addition, the poem’s short length minimized participant fatigue.

For each question, participants provided a written answer and justification. Participants also provided a rating indicating how certain they were in their answer. Certainty was rated on a three-point scale (1 = uncertain, 2 = neither, 3 = certain). A three-point scale was used to capture meaningful differences in certainty while reducing response burden and limiting overinterpretation of fine-grained scale distinctions [29]. The questions were designed to facilitate an even distribution of phase observations. Three questions were designed to elicit either emergent or deep understanding, depending on the depth of participants’ answers. Three questions were designed to be more challenging and thus elicit nascent understanding and confusion. Two questions were designed to elicit misunderstanding. Underconfidence was allowed to emerge naturally.

Because supervised machine learning was used to analyze the data, each observation included in the dataset had to be assigned a phase label prior to analysis [30]. Phase labels were determined using: (1) written responses and certainty ratings. This methodology was adopted given the limited work linking AUs to phases of understanding beyond confusion, the low agreement among subjective raters in identifying phases, including preliminary analyses in which independent raters attempted to assign phase labels directly from AU patterns and showed agreement below chance, and its demonstrated success in Lazic et al. [7]. To determine correctness, responses were evaluated using three criteria. First, textual evidence: interpretations had to be supported by specific elements of the poem, such as word choice, structure, or thematic development. Second, literary consensus: interpretations were evaluated in relation to established scholarly readings to ensure they aligned with, or extended, widely recognized analyses [31,32,33]. Third, universality: interpretations were expected to resonate with shared human experiences or themes rather than reflect idiosyncratic or unsupported claims. Appendix B presents the codebook derived from these criteria, which was used to classify responses as incorrect, correct, or correct with depth, and therefore operationalize the phases below alongside certainty ratings, consistent with Olson’s account of understanding as a publicly evaluable linguistic practice; and (2) automated facial feature flagging with contextual verification for confusion, which represents a partial exception given prior empirical links between specific AUs and confusion.

All phase labels were assigned by the first author. To assess the consistency with which the coding framework was applied, inter-rater reliability was evaluated on a subset of responses using an independent coder trained on the same codebook, yielding perfect agreement (Gwet’s AC1 = 1.00), indicating high reliability in the application of the coding criteria.

Nascent understanding was operationalized as an incorrect answer or justification combined with uncertainty. Misunderstanding was operationalized as an incorrect answer or justification combined with certainty. Confusion was identified using automated facial feature flagging with contextual verification. After AUs were extracted from the Zoom recordings, candidate confusion intervals were flagged based on sustained elevations in brow lowering and lid tightening (AU4 and AU7). These AUs were selected to identify candidate intervals because they have been consistently linked to confusion and cognitive disequilibrium in previous studies, e.g., [21]. These intervals were then reviewed in context to confirm that they coincided with behavioural indicators of uncertainty or impasse—for example, participants explicitly stating that they were confused or rereading a contradictory section of the poem. Only intervals meeting these criteria were labeled as confusion. Emergent and deep understanding were both operationalized as a correct answer combined with certainty. They were distinguished based on the depth of the response. Deep understanding responses demonstrated integrated and well-developed interpretations supported by multiple elements of the text. Emergent understanding responses were less fully developed, typically reflecting a more surface-level or partially articulated interpretation. Lastly, underconfidence was operationalized as a correct answer combined with uncertainty. If participants provided a correct or incorrect response but indicated they were neither certain nor uncertain in their response, a phase label was assigned based on correctness to avoid loss of data.

This study adopted a naturalistic approach to identifying phases. Phases were identified within the broader flow of responses, and instances were not excluded if noise was present. Although this approach introduced greater variability, it provided a test of whether AU patterns associated with each phase could be detected in a less controlled setting.

2.3.2. Action Units

AUs were extracted from the Zoom video recordings using OpenFace (version 2.2.0) [23], an open-source computer vision toolkit that automatically codes facial muscle activity based on FACS. OpenFace detects 17 AUs and provides frame-level estimates of both AU intensity (0–5) and presence (0 = absent, 1 = present). These frame-level values were aggregated into nonoverlapping 2 s windows to capture sustained facial activity. Within each window, AU intensity was summarized using the mean and standard deviation, and AU presence was summarized as the proportion of frames in which the AU was active.

For written responses, AUs were measured during the interval in which participants provided their final answer and justification. The length of this interval varied substantially across participants, as responses differed in the amount of thought, elaboration, and reflection involved. In cases where behavioural cues indicated that the phase began before or continued beyond the writing period, the measurement window was extended accordingly. For confusion, AUs were measured for the duration of each confusion interval identified through the automated flagging and contextual verification procedure described above.

2.4. Data Analysis

2.4.1. Machine Learning

Supervised machine learning was used to identify AU patterns associated with phases of understanding. AU intensity and presence were both included as model features. Two classification algorithms were evaluated: CatBoost and logistic regression. These algorithms were selected based on the findings of Lazic et al. [7] in which both linear and nonlinear models demonstrated strong performance and appeared well-suited to the complexity of the data. CatBoost, a gradient-boosted tree algorithm, was included as the nonlinear model because it can capture complex relationships while remaining robust to multicollinearity and minimal preprocessing [34]. Logistic regression was retained as the baseline model to assess whether phases could be distinguished using linear decision boundaries [35].

All analyses were conducted in Python (version 3.12.3). The gradient boost model was trained using “CatBoost” (v1.2.8). The logistic regression model was trained using the “scikit-learn” package (v1.4.2), which was also used to compute evaluation metrics. Data processing was conducted using “pandas” (v2.2.2) and “NumPy” (v1.26.4). The analysis pipeline consisted of the following steps:

Feature scaling: Feature scaling was applied for the logistic regression model, as this algorithm is sensitive to the scale of input features [36]. AU values were standardized within each training fold and applied to the corresponding validation fold to prevent data leakage [37].
Selecting evaluation metrics: Models were evaluated using macro-averaged F1 score as the primary performance metric, along with macro-averaged precision and recall. Macro-averaging was used to prevent performance estimates from being dominated by the most frequent phases [38]. Balanced accuracy and overall accuracy were also calculated. Per-phase precision, recall, and F1 scores were computed to assess class-specific performance. All metrics were calculated separately within each cross-validation fold using held-out participants and then averaged across folds [39]. Together, these metrics provide complementary information regarding overall discrimination and performance on minority phases.
Cross-validation: Five-fold grouped cross-validation was used to evaluate model performance, balancing robust validation with maintaining adequate class representation in both training and validation splits [39].
Training logistic regression model: A logistic regression model with L2 regularization was trained as a baseline model. The regularization strength (C) was selected within each outer cross-validation fold using a participant-disjoint internal validation split [40]. Candidate values were C ∈ {0.01, 0.1, 1, 10}, and the best-performing value was retained within each fold.
Training CatBoost model: A CatBoost classifier was trained with hyperparameters selected within each outer cross-validation fold using a participant-disjoint internal validation split [41] and early stopping. The tuning grid included depth ∈ {4, 6, 8}, learning_rate ∈ {0.03, 0.1}, and l2_leaf_reg ∈ {1, 3, 10}. Early stopping determined the optimal number of boosting iterations within each fold [42]. Automatic class balancing was used to account for phase imbalance.

2.4.2. Within-Person Analysis

In addition to the population-level machine learning models, a within-person analysis was conducted. Whereas the machine learning models tested whether phase-related AU patterns generalize across individuals, the within-person analysis tested whether participants showed differences in AU activity across phases. Differences in AU activity between phases were evaluated using pairwise comparisons. This analysis only included AU intensity summary values calculated within each 2 s window to reduce the risk of Type I error. The analysis pipeline consisted of the following steps:

Feature selection: To reduce the risk of Type I error and maintain interpretability, analyses were restricted to a subset of AUs that capture key facial areas: AU4 (brow lowering), AU7 (lid tightening), AU12 (lip corner puller), and AU15 (lip corner depressor).
Phase eligibility: Phases were compared only if they occurred in most participants, allowing stable estimation of participant baselines and reliable comparisons.
Computing participant-specific baselines: Each participant’s typical level and variability of AU activity were estimated across the phases being compared. To do this, all 2 s windows from those phases were pooled to calculate a baseline mean and standard deviation for each AU within participants. These baseline values were then used to compute within-person deviation scores.
Computing within-person deviation scores: For each AU and 2 s window, the intensity summary values were converted to within-person z-scores. This transformation expressed each window’s summary value relative to the participant’s baseline for that AU. If the baseline standard deviation for a given AU summary value was zero, deviation scores were set to missing.
Aggregating deviations by phase: For each participant and AU summary value, deviation scores from all 2 s windows within a given phase were averaged. This produced one participant-level mean deviation score per AU summary value for each phase.
Statistical comparison of phases: For each AU summary value, paired-sample t-tests were conducted across participants to compare phases. Cohen’s d^z was calculated to estimate the magnitude of within-person differences. p-values were adjusted using the Benjamini–Hochberg false discovery rate procedure [43].
Nonparametric robustness check: To assess whether results depended on the assumption of normally distributed difference scores, Wilcoxon signed-rank tests were conducted for each AU summary value as a nonparametric alternative to the paired-sample t-tests.

3. Results

3.1. Demographics

The sample consisted of 198 participants (Mage = 23.2 years, SD = 7.3), the majority of whom were female (67.2%). Participants were enrolled in or completed a range of academic programs (Table 1).

3.2. Descriptive Statistics

A total of 1619 observations were identified across 56,557 2 s windows (Table 2). Nascent understanding was the most frequent phase, followed by emergent understanding and misunderstanding. Confusion occurred less frequently. Deep understanding and underconfidence were relatively rare; consequently, windows were not processed for these phases.

Table 3 provides representative examples of participant responses to the questions for the poem across phases of understanding, showing how answers, justifications, brainstorming, and certainty levels differ by phase.

3.3. Machine Learning

Due to their limited number of observations, deep understanding and underconfidence were excluded from the machine learning analyses to reduce instability during model training [44].

For the logistic regression model, the regularization strength varied across folds (C = 1 in two folds, C = 0.01 in two folds, and C = 0.10 in one fold). For CatBoost, hyperparameter values also varied across folds, although depth = 8, l2_leaf_reg = 3 and learning_rate = 0.10 were selected most frequently. The number of boosting iterations determined by early stopping within each fold ranged from 13 to 164.

Table 4 and Table 5 present the performance of the CatBoost and logistic regression models. Table 4 summarizes overall model performance across metrics. The logistic regression model showed poor recovery of minority phases and was strongly influenced by class imbalance, with low macro-averaged F1 and balanced accuracy. In contrast, the CatBoost model showed improved macro-averaged F1 and balanced accuracy, with balanced accuracy exceeding chance for a four-class problem (0.25), although overall performance remained modest.

Table 5 presents phase-specific precision, recall, and F1 scores, showing that performance varied across phases and differed by classifier. Nascent understanding was the most consistently identified phase in both models. Logistic regression showed very high recall (0.94) but only moderate precision (0.51), indicating a tendency to over-predict this phase, whereas CatBoost showed more balanced but lower recall (0.33). Misunderstanding was identified to some extent by CatBoost, with moderate precision and recall, but was rarely detected by logistic regression, which showed near-zero recall and F1. For confusion, CatBoost showed high recall (0.53) but very low precision (0.04), suggesting frequent but often incorrect identification, while logistic regression identified fewer instances but with somewhat higher precision. Emergent understanding was poorly identified by both models, particularly by logistic regression, which showed no recall for this phase. Overall, the results indicate uneven and limited discrimination between phases.

To further examine model performance, a confusion matrix for the CatBoost classifier is presented in Figure 1. The matrix shows that predictions were widely distributed across phases rather than concentrated along the diagonal, indicating limited separability between classes. Confusion was occasionally identified correctly but exhibited very low precision, reflecting frequent false positive predictions for this class. In contrast, nascent understanding, misunderstanding, and emergent understanding were commonly misclassified with one another, with no phase showing consistently high classification accuracy. This dispersion of predictions suggests that the model did not identify distinct, phase-specific patterns in the data, but instead relied on features that were shared across multiple phases.

Shapley Additive Explanations (SHAP) analyses were conducted to examine how the CatBoost model used AU features to make its predictions [45]. Mean absolute SHAP values were calculated to summarize the magnitude of each feature’s contribution to the model, with global values aggregated across all observations and class-specific values aggregated within each phase of understanding (Table 6). The results show that no single AU feature dominated the model’s predictions; instead, influence was distributed across several features. AU4 (brow lowering) showed the highest global importance and was especially influential in predictions of confusion, as reflected by larger mean absolute SHAP values for that phase relative to other features. Other AU features also contributed to the model’s predictions, differing mainly in strength across phases rather than in being unique to a particular phase. Importantly, the relative similarity of feature importance across phases indicates that the same set of AUs contributed to multiple phase predictions, rather than providing phase-specific signals. This pattern suggests that the model relied on shared, non-distinctive features across phases, limiting its ability to differentiate between them and contributing to the overall modest classification performance.

Beeswarm plots were also generated to examine the distribution and direction of feature contributions across observations for each phase of understanding (Figure 2). To enhance interpretability, only the five AU features with the highest global mean absolute SHAP values are shown. The plots suggest that the CatBoost model did not rely primarily on stable main effects of individual AU features. Instead, several features showed observation-dependent contributions, with the same feature contributing to different directions across cases within the same phase. This pattern was especially evident outside confusion, where most features clustered near zero and showed substantial overlap in both magnitude and direction. By contrast, AU4_mean showed a broader, more consistently directional contribution to confusion, with higher values generally associated with positive SHAP values and lower values with negative SHAP values. This interpretation is consistent with prior work identifying AU4 as a prominent marker of confusion or closely related states. For example, some have treated AU4 as the main facial indicator of experimentally induced confusion [24] and linked AU4 to confusion and frustration-related expressions during learning [25,26]. Taken together, these patterns suggest that the model relied on a stronger local decision rule for confusion, while the remaining phases were characterized more by weak, context-dependent combinations of cues than by robust, facial signatures.

3.4. Within-Person Analysis

The within-person analysis focused on nascent and emergent understanding because these phases were experienced by a large proportion of participants (see Table 2). Within-person comparisons require that phases be observed within the same individuals with sufficient frequency to support stable estimation of participant-specific baselines. Although misunderstanding occurred frequently overall, it was experienced by a smaller proportion of participants than emergent understanding. Restricting the analysis to nascent and emergent understanding therefore allowed for a larger and more consistent sample of participants, supporting more reliable within-person comparisons. A total of 156 participants were included in the analysis.

Paired comparisons of participant-level deviation scores tested whether, on average, participants showed systematic differences between nascent and emergent understanding (Table 7). Across AU features, most within-person differences were small and not statistically significant, indicating that participants did not reliably increase or decrease their facial activity during nascent understanding relative to emergent understanding. However, both the mean and variability of AU7 showed statistically significant differences after correcting for multiple comparisons. Although statistically significant, the effect sizes for AU7 were modest in magnitude (d^z = −0.324 and −0.275), corresponding to small within-person effects. This indicates that, even where differences were detected, the magnitude of change in facial activity between phases was limited. Wilcoxon signed-rank tests showed the same pattern of results, suggesting that these findings were not dependent on distributional assumptions.

Distributional summaries of participant-level difference scores revealed substantial variability across individuals (Table 8). For most features, mean differences were close to zero, indicating little overall average shift between phases. At the same time, the standard deviations were large relative to these mean values, showing that individual changes were often much larger than the average effect. In many cases, the proportion of participants showing positive versus negative differences was nearly balanced, indicating no dominant direction of change. Even for AU7, which showed statistically significant average differences in the paired t-tests, changes were not uniform across individuals. Although a majority showed decreases from nascent to emergent understanding, a substantial minority showed increases.

Overall, the within-person results suggest that changes between nascent and emergent understanding are small and not consistent across individuals. Although some average differences were detected, participants varied widely in how their facial activity changed.

4. Discussion

This study examined whether AU patterns are associated with phases of understanding during a literary analysis task. Machine learning models showed limited ability to discriminate phases at the population level, and within-person comparisons revealed only modest and inconsistent differences between nascent and emergent understanding. Therefore, AUs only provided weak signal for distinguishing phases of understanding in both analyses.

These results may reflect that phase-related facial activity is person-specific. Constructionist theories argue that a psychological state can be realized through different configurations of bodily activity, and that categories such as “confusion” are abstractions derived from instances of that category that vary by individual and context [27,28]. Accordingly, there may not be a single AU pattern that reliably corresponds to a given phase of understanding across participants. This helps explain why population-level models showed limited discrimination. At the same time, the findings of Lazic et al. [7] do not necessarily contradict this view; their machine learning analyses may have captured statistical regularities at the population level without reflecting invariant patterns within individuals. This interpretation also helps explain the within-person results: although individuals may show some differentiation between phases, the specific configurations through which these phases are expressed vary across individuals. Therefore, phase-related structure may exist, but in an idiosyncratic form that is not well captured by models assuming stable patterns across individuals.

The nature of the task used in this study may also help explain the results. In well-defined tasks such as solving riddles [7], understanding is anchored to convergent solutions and identifiable transitions from not knowing to impasse to solution [46]. As a result, both performance and the subjective feeling of understanding are easier to define, and participants can more readily recognize where they are in the problem-solving process, creating sharper experiential boundaries between phases. By contrast, literary interpretation has no single correct answer or fixed path to a solution. Instead, understanding is evaluated against interpretive standards such as textual support, coherence, and depth rather than a single endpoint [3], making objective assessment more graded and inferential. Under these conditions, the subjective feeling of understanding may also be less clearly organized, with uncertainty, partial insight, reinterpretation, and confidence overlapping rather than forming sharply bounded states. As a result, individuals may have a less precise sense of phase boundaries, and the relationship between objective and subjective understanding may be less tightly aligned. This interpretation is consistent with work suggesting that affective-cognitive states during complex learning are dynamic and transitional rather than fixed [21], as well as broader arguments that people do not always have direct or fully reliable access to the processes underlying their judgments and feelings of knowing [14]. Consequently, AU patterns associated with these phases may be less likely to appear as clear contrasts across and within individuals. Rather than indicating that AUs are unrelated to phases of understanding, the results may instead reflect that AU-based phase detection depends partly on task structure and on how clearly phases can be identified, both temporally and conceptually.

Similarly, the task may have introduced additional psychological states that made phases of understanding more difficult to identify. Literary analysis is extended and open-ended, requiring sustained attention and interpretation. As a result, participants may have also experienced boredom, interest, or frustration among other states that may have overlapped with or obscured patterns related to phases of understanding [47].

This explanation applies specifically to the results observed for confusion, which may be affectively heterogeneous. In learning contexts, confusion is not always experienced negatively. It can be beneficial when it signals cognitive disequilibrium that prompts reflection, reconceptualization, and eventual resolution, but it can also lead to frustration, boredom, or disengagement if left unresolved [48,49]. Therefore, two learners classified as confused may still experience that phase differently: one as a constructive challenge and another as an aversive impasse. This variability may help explain the inconsistency in AU patterns observed for confusion. More specifically, facial expressions during confusion may depend not only on task context, but also on individuals’ dispositions toward confusion and how they experience and regulate it.

Lastly, the way phases were operationalized also likely influenced the results. Phases were defined using correctness, certainty ratings, and behavioural signals. While these criteria were theoretically grounded and applied consistently, they capture only an approximation of what is a complex and dynamic process. Moreover, assigning phase labels based on certainty ratings assumes that these ratings reliably reflect participants’ internal states, which may not always hold [14]. This issue was compounded by assigning labels based on correctness when participants reported being neither certain nor uncertain in their answers. Although this prevented loss of data, it introduced noise by forcing ambiguous or intermediate states into discrete categories.

Overall, the results highlight the challenge of measuring the feeling of understanding using AUs in a naturalistic academic context. This challenge aligns with findings from affective computing research, where model performance typically declines when moving from controlled settings to in-the-wild conditions. For example, in a large-scale cross-corpus study using in-the-wild facial expressions to detect emotions, a robust deep learning model achieved moderate accuracy (~66% on a benchmark dataset), underscoring the limits of using facial activity to measure affect in real-world conditions [50]. Importantly, however, these studies typically focus on basic emotional categories, which are more visually distinct than the cognitively grounded and conceptually adjacent phases examined in the present study. As a result, the classification problem addressed here is likely more challenging, as it involves detecting subtle, context-dependent states that may not have stable or distinctive facial signatures. More broadly, the study illustrates a common trade-off between prioritizing ecological validity and introducing noise that obscures psychological phenomena. Accordingly, the results may reflect the need for further research and methodological refinement, rather than indicating that the feeling of understanding does not exist or differ across phases.

One possible methodological implication that follows from this is how phase labels are represented in future work. If phase boundaries are overlapping or only partially distinguishable in naturalistic tasks, approaches that assign a single hard label to each observation may underrepresent the ambiguity of the underlying phenomenon. Research on soft-label and distributional approaches has shown that, in contexts involving ambiguous or overlapping categories, maintaining uncertainty in the target labels can offer a more accurate representation of the data than collapsing judgments into a single category [51,52,53]. Although the current study did not involve probabilistic labels, this line of research points to a promising direction for future investigations into phases of understanding. Studies of more open-ended academic tasks might benefit from methods that explicitly represent intermediate or uncertain cases, rather than forcing them into fully discrete categories. Although the present study cannot determine whether such methods would improve performance, they may provide a useful way for future research to examine whether the weaker results observed here reflect limitations in AU-based detection itself, ambiguity in the labelling process, or both.

This study has two main limitations. First, some phases could not be included in the machine learning and within-person analyses due to an insufficient number of observations. Although the questions were designed to elicit a balanced range of phase observations, the occurrence of phases of understanding cannot be fully controlled in an open-ended literary task. Participants differ in prior knowledge, interpretive ability, and engagement, which influences how often different phases arise and makes their distribution difficult to anticipate. In addition, some phases are inherently less likely to occur frequently given their definitions. For example, deep understanding requires not only correctness but also a high level of integration and coherence across the text, making such responses relatively uncommon within a single-session task. As a result, the findings do not reflect the full six-phase framework. Second, most participants were female undergraduate students from the same university, limiting the generalizability of the findings and potentially affecting AU extraction accuracy. OpenFace is a widely used and validated tool for extracting facial action units, but the existing literature does not systematically examine whether its performance varies across demographic groups such as gender, race, or age. Given broader findings in computer vision demonstrating that model performance can differ across demographic groups [54], this may have introduced additional variability into the AU measurements.

Several directions for future work follow from this study. First, future studies should continue to explore whether AU patterns are associated with phases of understanding across various academic tasks and populations using naturalistic designs. Second, future studies should explore whether phases of understanding can be identified using other physiological channels, such as speech, heart rate variability, and skin conductance. Each of these channels may capture aspects of the feeling of understanding that AUs do not [55]. Combining these channels within a single model would likely further improve classification accuracy [56]. Third, future studies should collect more data, both for phases that were excluded from this study’s analyses and for those that occurred infrequently, to improve model stability and enable analysis of all six phases of understanding. Finally, future studies should explore deep learning approaches, as end-to-end models would remove the need for manual feature extraction, may detect patterns that engineered features miss, and will be necessary for developing AI systems that identify phases of understanding in real time [57].

5. Conclusions

Whether LLMs can help students develop understanding depends on how we define it. If we reduce understanding to what students express in language, then current LLMs may already do what is needed. However, if we also define understanding as a feeling, then fully supporting it requires finding a way to measure this dimension. The present study took up this challenge in a naturalistic academic context. The findings suggest that doing so is difficult with facial action units alone and point to an important boundary condition for future research. Clarifying whether the limits observed here arise primarily from the physiological channel analyzed, the ambiguity of phase labels, or both remains an important next step.

Author Contributions

Conceptualization, M.L. and E.W.; methodology, M.L. and E.W.; formal analysis, M.L.; data curation, M.L.; writing—original draft preparation, M.L. and E.W.; writing—review and editing, M.L. and E.W.; visualization, M.L.; supervision, E.W.; project administration, E.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was approved by the University of Toronto Research Ethics Board (protocol number: 47895) on 21 February 2025.

Data Availability Statement

The dataset used in this study is not readily available. Requests to access the dataset should be directed to: steven.lazic@mail.utoronto.ca.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.

Appendix B

Why does the speaker feel the road he took made all the difference?
- Incorrect: Common but flawed “less traveled road” reading
  ○
  Example: He chose the road because it was less traveled, and taking a more difficult, unique path led to a bigger impact on his life.”
- Correct: Noticing the irony but stopping there
  ○
  Example: The speaker says the road ‘made all the difference,’ but in reality both roads were the same—Frost is being ironic.”
- Correct with Depth: Adding insights about human nature or broader thematic resonance
  ○
  Example: “Though the poem ends by saying the road ‘made all the difference,’ the speaker admits earlier that both roads looked equally worn. Frost uses this tension to suggest how we rewrite our memories to highlight the uniqueness of our choices. It’s less about which road was truly different and more about how, in hindsight, people tell stories that justify or romanticize their past decisions.”
How does the morning setting symbolize the speaker’s clarity in deciding to take the road less traveled?
- Incorrect: Answers echo the question’s misleading assumption (morning = actual clarity, one road is truly “less traveled”).
  ○
  Example: “Because it was morning, the speaker could see one road was definitely less traveled and knew with absolute certainty that this was the path to take, which is why it made all the difference.”
- Correct: Recognize the poem’s contradiction of that assumption but stop short of exploring deeper implications.
  ○
  Example: “Although the question suggests the speaker had clarity, the poem emphasizes that both roads looked equally untraveled that morning. So, the ‘morning setting’ may symbolize a fresh start, but it doesn’t guarantee real clarity.”
- Correct with Depth: Correct the misreading and link the poem’s subtlety to broader themes of human psychology, choice, or narrative-building.
  ○
  Example: “The question presupposes the speaker knew which road was less traveled. However, the poem tells us both roads were equally untraveled that morning. Frost uses the morning setting to suggest a feeling of newness or potential clarity—but in reality, the speaker couldn’t see where either road led. Later on, he frames his choice as more significant than it might have been, revealing how we create stories of clarity and uniqueness about our past decisions.”
How does the rhythm and meter of “though as for that the passing there” differ from the rest of the poem?
- Incorrect: Claims no difference in meter or otherwise disregards the poem’s overall prosodic structure.
  ○
  Example: “It’s exactly the same as all the other lines—Frost didn’t change anything about the rhythm.”
- Correct: Points out there is a metrical shift in that line without exploring why it might matter.
  ○
  Example: “Most of the poem is in a regular iambic tetrameter, but in ‘though as for that the passing there,’ there’s a slight shift in stress or an extra syllable that makes it scan differently from the rest.”
- Correct with Depth: Describes the nature of the shift and connects it to the poem’s broader themes or emotional effect.
  ○
  Example: “Frost relies on an iambic tetrameter throughout most of the poem, but in ‘though as for that the passing there,’ he disrupts the expected rhythm. The stresses shift—there may be a trochaic substitution or an extra unaccented syllable—so the line feels a bit off-balance. This break in the steady meter mirrors the speaker’s moment of hesitation or uncertainty, subtly underlining how no choice here is truly ‘less traveled.’”
Why does the speaker doubt he will return to take the other road?
- Incorrect: Attributes the speaker’s doubt to something absent or contradictory in the text.
  ○
  Example: “He dislikes that other road, so he won’t go back. It’s closed, and he doesn’t want to pass through it.”
- Correct: Accurately notes that subsequent life choices prevent returning to the same fork.
  ○
  Example: “He knows that once he starts down one road, it will lead him on to other choices, and he probably won’t ever be able to come back to this exact spot.”
- Correct with Depth: Integrates a deeper understanding, linking the speaker’s practical doubt to the universal, irreversible nature of life decisions.
  ○
  Example: “Although he says ‘Oh, I kept the first for another day!’ he also recognizes that each decision leads to new opportunities and obligations, so it’s practically impossible to go back to the exact crossroads. This highlights Frost’s broader idea that life’s choices aren’t just physically, but also psychologically, unrepeatable—we can’t recreate the same moment to pick the other option later.”
What does the undergrowth represent?
- Incorrect: Treats the undergrowth as merely literal bushes or an irrelevant obstacle.
  ○
  Example: “The undergrowth is just thick bushes and has no deeper meaning. The speaker doesn’t want to walk through it because it might have thorny plants.”
- Correct: Identifies the undergrowth as a symbol of the road’s (or the future’s) uncertainty.
  ○
  Example: “The undergrowth represents what the speaker cannot see about his future. It blocks his view of how the road will turn out.”
- Correct with Depth: Builds on the basic symbolic meaning and ties it to Frost’s themes of choice, the unseen consequences of decisions, and our universal human struggle with the unknown.
  ○
  Example: “Frost describes one road as bending into the undergrowth, which literally prevents the speaker from seeing where it leads. On a deeper level, it symbolizes the unpredictable future. No matter which path we choose in life, part of it is obscured by uncertainty—we can’t fully know the consequences before we go. This undergrowth thus encapsulates the tension between our desire for clarity and the reality that we must choose amid the unknown.”
How does the repetition of ‘I’ in the final stanza relate to self-deception?
- Incorrect: Sees no connection between “I” repetition and any deeper meaning or mischaracterizes it completely.
  ○
  Example: “There is no self-deception. Frost just liked using ‘I’ multiple times to sound poetic. It doesn’t mean anything.”
- Correct: Notes that the speaker’s repetition of “I” underscores personal emphasis and possible exaggeration.
  ○
  Example: “By repeating ‘I,’ the speaker draws attention to himself, hinting that he wants to emphasize his personal role in taking a supposedly unique road. It suggests he might be inflating the significance of his choice.”
- Correct with Depth: Links this repetition to a deeper pattern of self-deception or myth-making, tying it to the poem’s irony and universal human tendencies to reshape our own narratives.
  ○
  Example: “In repeating ‘I,’ the speaker spotlights himself as the active hero of his own story, suggesting a need to appear decisive and exceptional. This repetition, however, reveals a subtle self-deception: earlier in the poem, he admits the roads were ‘really about the same,’ so his claim of a ‘less traveled’ road is more of a retrospective myth. By stressing ‘I,’ he’s reinforcing a narrative where he made a bold, individualistic choice—even though the text hints that might not be strictly true.”
Why is the poem set in the woods?
- Incorrect: Provides irrelevant or purely literal explanations (e.g., “a random picnic spot”).
  ○
  Example: “It’s set in the woods so the traveler could look at animals and have a picnic. Frost just picked it randomly.”
- Correct: Recognizes the woods as a place of branching paths and solitary decision-making.
  ○
  Example: “The woods offer a secluded spot where two roads branch off, forcing the speaker to make a choice without outside influence.”
- Correct with Depth: Connects the woods’ mystery and isolation to the universal experience of choosing among unknown futures, underscoring the poem’s existential or psychological themes.
  ○
  Example: “Frost situates the fork in the woods to evoke a place removed from everyday distractions, emphasizing the solitude and uncertainty of life choices. The dense undergrowth symbolizes the unknown outcomes, and the silent, natural setting mirrors how personal decisions often occur when we’re most isolated—reflecting our universal experience of forging a path without clear foresight.”
How does the speaker’s sigh reflect his rationalization of his decision?
- Incorrect: Ignores the poem’s context, treats the sigh as purely literal or otherwise unconnected to rationalization.
  ○
  Example: “He sighs because he’s physically exhausted after a long walk in the woods. It has nothing to do with rationalizing his choice.”
- Correct: Recognizes the sigh as part of the speaker’s reflective or explanatory tone but does not fully integrate the poem’s ironic subtext.
  ○
  Example: “He sighs as he looks back on his decision, suggesting he’s giving weight to the idea that choosing this road was important. It’s part of how he explains his choice to himself or others.”
- Correct with Depth: Explores how the sigh functions as part of the speaker’s story-making or self-deception, tying in the irony that both roads were the same and showing how this final sigh is a key to the speaker’s retrospective mythologizing.
  ○
  Example: “When the speaker says, ‘I shall be telling this with a sigh,’ he’s envisioning a future in which he frames his decision as pivotal. This sigh could be regret, but more likely it’s a self-conscious flourish—he’s dramatizing his choice. Given the poem’s earlier hint that the roads were identical, the sigh becomes a tool of self-deception, helping him rationalize that he took a road ‘less traveled’ and thus made a daring, life-changing move.”

References

Paustian, T.; Slinger, B. Students are using large language models and AI detectors can often detect their use. Front. Educ. 2024, 9, 1374889. [Google Scholar] [CrossRef]
Singer-Freeman, K.; Verbeke, K.; Barre, B. Generative AI usage among university students depends on academic level and task. High. Learn. Res. Commun. 2025, 15, 1–25. [Google Scholar] [CrossRef]
Olson, D.R. Making Sense: What It Means to Understand; Cambridge University Press: Cambridge, UK, 2022. [Google Scholar]
Olson, D.R. Ascribing understanding to ourselves and others. Am. Psychol. 2024, 79, 920–927. [Google Scholar] [CrossRef] [PubMed]
Dewey, J. Experience and education. Educ. Forum 1986, 50, 241–252. [Google Scholar] [CrossRef]
Bruner, J.S. The Process of Education; Harvard University Press: Cambridge, MA, USA, 2009. [Google Scholar]
Lazic, M.; Woodruff, E.; Jun, J. Decoding subjective understanding: Using biometric signals to classify phases of understanding. AI 2025, 6, 18. [Google Scholar] [CrossRef]
Woodruff, E. AI detection of human understanding in a Gen-AI tutor. AI 2024, 5, 898–921. [Google Scholar] [CrossRef]
Dreyfus, H.L. What Computers Still Can’t Do: A Critique of Artificial Reason; MIT Press: Cambridge, MA, USA, 1992. [Google Scholar]
Searle, J.R. Minds, brains, and programs. Behav. Brain Sci. 1980, 3, 417–424. [Google Scholar] [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar]
Chi, M.T.H. Active-constructive-interactive: A conceptual framework for differentiating learning activities. Top. Cogn. Sci. 2009, 1, 73–105. [Google Scholar] [CrossRef]
Feldman, L.; Mesquita, B.; Ochsner, K.N.; Gross, J.J. The experience of emotion. Annu. Rev. Psychol. 2007, 58, 373–403. [Google Scholar] [CrossRef]
Nisbett, R.E.; Wilson, T.D. Telling more than we can know: Verbal reports on mental processes. Psychol. Rev. 1977, 84, 231–259. [Google Scholar] [CrossRef]
Damasio, A. Feeling & Knowing: Making Minds Conscious; Pantheon: New York, NY, USA, 2021. [Google Scholar]
James, W. On some omissions of introspective psychology. Mind 1884, 9, 1–26. [Google Scholar] [CrossRef]
Craig, A.D. How do you feel? Interoception: The sense of the physiological condition of the body. Nat. Rev. Neurosci. 2002, 3, 655–666. [Google Scholar] [CrossRef]
Craig, A.D. How do you feel—Now? The anterior insula and human awareness. Nat. Rev. Neurosci. 2009, 10, 59–70. [Google Scholar] [CrossRef]
Ekman, P.; Friesen, W.V. Facial Action Coding System (FACS); Consulting Psychologists Press: Palo Alto, CA, USA, 1978. [Google Scholar]
Cacioppo, J.T.; Petty, R.E.; Losch, M.E.; Kim, H.S. Electromyographic activity over facial muscle regions can differentiate the valence and intensity of affective reactions. J. Pers. Soc. Psychol. 1986, 50, 260–268. [Google Scholar] [CrossRef]
D’Mello, S.K.; Graesser, A. Dynamics of affective states during complex learning. Learn. Instr. 2012, 22, 145–157. [Google Scholar] [CrossRef]
Larsen, J.T.; Norris, C.J.; Cacioppo, J.T. Effects of positive and negative affect on electromyographic activity over zygomaticus major and corrugator supercilii. Psychophysiology 2003, 40, 776–785. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Robinson, P.; Morency, L.-P. OpenFace: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
Blum, L.; Dieckmann, A.; Unfried, M. Confusion can improve cognitive performance: An experimental study using automatic facial expression analysis. Proc. Annu. Meet. Cogn. Sci. Soc. 2020, 42, 1–7. [Google Scholar]
Craig, S.D.; D’Mello, S.; Witherspoon, A.; Graesser, A. Emote aloud during learning with AutoTutor: Applying the Facial Action Coding System to cognitive–affective states during learning. Cogn. Emot. 2008, 22, 777–788. [Google Scholar] [CrossRef]
Grafsgaard, J.F.; Wiggins, J.B.; Boyer, K.E.; Wiebe, E.N.; Lester, J.C. Automatically recognizing facial indicators of frustration: A learning-centric analysis. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII), Geneva, Switzerland, 2–5 September 2013; pp. 159–165. [Google Scholar] [CrossRef]
Siegel, E.H.; Sands, M.K.; Van den Noortgate, W.; Condon, P.; Chang, Y.; Dy, J.; Quigley, K.S.; Barrett, L.F. Emotion fingerprints or emotion populations? A meta-analytic investigation of autonomic features of emotion categories. Psychol. Bull. 2018, 144, 343–393. [Google Scholar] [CrossRef] [PubMed]
Barrett, L.F. The theory of constructed emotion: An active inference account of interoception and categorization. Soc. Cogn. Affect. Neurosci. 2017, 12, 1–23. [Google Scholar] [CrossRef]
Krosnick, J.A.; Presser, S. Question and questionnaire design. In Handbook of Survey Research, 2nd ed.; Marsden, P.V., Wright, J.D., Eds.; Emerald: Bingley, UK, 2010; pp. 263–313. [Google Scholar]
Kotsiantis, S.B.; Zaharakis, I.; Pintelas, P. Supervised machine learning: A review of classification techniques. In Emerging Artificial Intelligence Applications in Computer Engineering; Maglogiannis, I., Karpouzis, K., Wallace, M., Soldatos, J., Eds.; IOS Press: Amsterdam, The Netherlands, 2007; Volume 160, pp. 3–24. [Google Scholar]
Lynen, J.F. The Pastoral Art of Robert Frost; Yale University Press: New Haven, CT, USA, 1960. [Google Scholar]
Orr, D. The Road Not Taken: Finding America in the Poem Everyone Loves and Almost Everyone Gets Wrong; Penguin Press: New York, NY, USA, 2015. [Google Scholar]
Thompson, L.; Winnick, R.H. Robert Frost: The Later Years, 1938–1963; Holt, Rinehart and Winston: New York, NY, USA, 1966; Volume 3. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Nusinovici, S.; Tham, Y.C.; Yan, M.Y.C.; Ting, D.S.W.; Li, J.; Sabanayagam, C.; Cheng, C.Y. Logistic regression was as good as machine learning for predicting major chronic diseases. J. Clin. Epidemiol. 2020, 122, 56–69. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
Wilimitis, D.; Walsh, C.G. Practical considerations and applied examples of cross-validation for model development and evaluation in health care: Tutorial. JMIR AI 2023, 2, e49023. [Google Scholar] [CrossRef]
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer: New York, NY, USA, 2013. [Google Scholar]
Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 2010, 11, 2079–2107. [Google Scholar]
Varoquaux, G.; Raamana, P.R.; Engemann, D.A.; Hoyos-Idrobo, A.; Schwartz, Y.; Thirion, B. Assessing and tuning brain decoders: Cross-validation, caveats, and guidelines. NeuroImage 2017, 145, 166–179. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Methodol. 1995, 57, 289–300. [Google Scholar] [CrossRef]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Jonassen, D.H. Instructional design models for well-structured and ill-structured problem-solving learning outcomes. Educ. Technol. Res. Dev. 1997, 45, 65–94. [Google Scholar] [CrossRef]
D’Mello, S.K.; Graesser, A. Multimodal semi-automated affect detection from conversational cues, gross body language, and facial features. User Model. User-Adapt. Interact. 2010, 20, 147–187. [Google Scholar] [CrossRef]
D’Mello, S.; Lehman, B.; Pekrun, R.; Graesser, A. Confusion can be beneficial for learning. Learn. Instr. 2014, 29, 153–170. [Google Scholar] [CrossRef]
Lehman, B.; D’Mello, S.; Strain, A.; Mills, C.; Gross, M.; Dobbins, A.; Graesser, A. Inducing and tracking confusion with contradictions during complex learning. Int. J. Artif. Intell. Educ. 2013, 22, 85–105. [Google Scholar] [CrossRef]
Ryumina, E.; Dresvyanskiy, D.; Karpov, A. In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study. Neurocomputing 2022, 514, 435–450. [Google Scholar] [CrossRef]
Collins, K.M.; Bhatt, U.; Weller, A. Eliciting and learning with soft labels from every annotator. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Virtual, 6–10 November 2022; Volume 10, pp. 40–52. [Google Scholar] [CrossRef]
Fornaciari, T.; Uma, A.; Paun, S.; Plank, B.; Hovy, D.; Poesio, M. Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Online, 6–11 June 2021; pp. 2591–2597. [Google Scholar] [CrossRef]
Singh, A.; Tiwari, A.; Hasanbeig, H.; Gupta, P. Soft-label training preserves epistemic uncertainty. arXiv 2025, arXiv:2511.14117. [Google Scholar] [CrossRef]
Buolamwini, J.; Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. Proc. Mach. Learn. Res. 2018, 81, 1–15. [Google Scholar]
Cacioppo, J.T.; Tassinary, L.G.; Berntson, G.G. (Eds.) Handbook of Psychophysiology, 3rd ed.; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar] [CrossRef]
Ryumina, E.; Ryumin, D.; Axyonov, A.; Ivanko, D.; Karpov, A. Multi-corpus emotion recognition method based on cross-modal gated attention fusion. Pattern Recognit. Lett. 2025, 190, 192–200. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Confusion matrix showing classification performance of the CatBoost model across phases of understanding.

Figure 2. Distribution and direction of SHAP values for AU features across phases of understanding.

Table 1. Participant demographics.

	n	%	M	SD
Gender
Female	133	67.2
Male	59	29.8
Other	6	3
Program of study
Social sciences	65	32.8
STEM	103	52
Humanities	12	6.1
Other	18	9.1
Age			23.2	7.3

N = 198.

Table 2. Distribution of observations and windows across phases of understanding.

Phase of Understanding	Observations	Windows	Median Observations	% of Individuals ≥1/≥2 (%)
Nascent	742	29,085	4	94.8/89.6
Misunderstanding	305	13,106	2	71.5/47.7
Confusion	129	620	1	36.3/17.1
Emergent	333	13,746	2	83.9/56
Deep	32	—	1	12.4/3.1
Underconfidence	78	—	1	28.5/8.8

N = 193; Values in the ≥1/≥2 column indicate the percentage of participants exhibiting at least one and at least two observations of the corresponding phase, respectively.

Table 3. Representative participant responses across phases of understanding, including answers, reasoning, and certainty ratings.

Phase	Question	Answer and Justification	Brainstorming	Level of Certainty
Nascent	How does the rhythm and meter of “though as for that the passing there” differ from the rest of the poem?	Unsure. Personally, the line felt choppier and more unnatural than the rest of the poem but I cannot articulate or provide concrete evidence for why.	unsure of what ‘meter’ means in poems. the rhythm sounds choppier and the words does not flow as well. at the same time i can’t pinpoint if this specific line is really that different from the rest of the poem	1 (Uncertain)
Misunderstanding	Why does the speaker feel the road he took made all the difference?	He felt it made all the different because he is trying something new out and might favour him in the long run.	He made a choice taking the road less travelled and he believed he made all the difference because he is exploring a new path that isn’t usually taken which makes him different from others and make the path less taken better.	3 (Certain)
Confusion	What does the undergrowth represent?		what is an undergrowth -_-? like hanging bush?? or is that a overgrowth
Emergent	Why is the poem set in the woods?	The woods set the setting of mystery and a symbol of branches of opportunities.	Woods... Symbolic for wisdom. Lost. slenderman! road? lost?	3 (Certain)
Deep	How does the repetition of ‘I’ in the final stanza relate to self-deception?	The speaker is reiterating to themselves that they made the brave, noble decision to take the path that was less traveled, despite not knowing what laid ahead and how many people had experienced that path. I believe the speaker is reassuring themselves of their decision by emphasizing the road was less traveled and that they should be proud of that decision, to avoid any regret for not taking the other path. In this way, the speaker is actively deceiving themselves. Additionally, the poem suggests that both paths were, in earnest, equally worn/unworn, so there is also possibly an element of the speaker deceiving themselves that they did truly pick the less traveled path, and thus deceiving themselves that they’re special/unique in doing so.		3 (Certain)
Underconfidence	What does the undergrowth represent?	The point at which the road bends at the undergrowth represents the farthest point he can see of the path. The undergrowth can thus represent uncertainty of what lies ahead, and fear of the unknown.		1 (Uncertain)

Table 4. Overall performance metrics for CatBoost and logistic regression models.

Metric	CatBoost	Logistic
Macro F1	0.24	0.20
Macro precision	0.26	0.27
Macro recall	0.35	0.26
Balanced accuracy	0.35	0.26
Accuracy	0.30	0.49

Table 5. Per phase performance metrics for CatBoost and logistic regression models.

Metric	Phase	CatBoost	Logistic
Precision	Nascent	0.52	0.51
Recall		0.33	0.94
F1 score		0.40	0.66
Precision	Misunderstanding	0.25	0.17
Recall		0.32	0.02
F1 score		0.27	0.03
Precision	Confusion	0.04	0.30
Recall		0.53	0.09
F1 score		0.07	0.09
Precision	Emergent	0.22	0.11
Recall		0.21	0.00
F1 score		0.21	0.01

Table 6. Global and class-specific mean absolute SHAP values for AUs across phases of understanding.

Action Unit	Global Value	Nascent	Misunderstanding	Confusion	Emergent
AU4_mean	0.185	0.137	0.111	0.362	0.126
AU5_prop	0.036	0.030	0.031	0.063	0.016
AU6_mean	0.019	0.008	0.020	0.036	0.016
AU7_mean	0.030	0.021	0.020	0.054	0.033
AU9_prop	0.022	0.013	0.017	0.082	0.012
AU10_mean	0.016	0.008	0.017	0.026	0.012
AU12_mean	0.011	0.011	0.014	0.011	0.009
AU14_prop	0.025	0.013	0.030	0.057	0.016
AU15_mean	0.023	0.012	0.024	0.063	0.012
AU17_mean	0.025	0.013	0.016	0.051	0.022
AU20_sd	0.010	0.007	0.008	0.011	0.007
AU23_mean	0.029	0.012	0.022	0.081	0.025
AU25_mean	0.013	0.010	0.010	0.022	0.009
AU26_mean	0.014	0.010	0.007	0.029	0.014
AU28_prop	0.000	0.000	0.000	0.000	0.000
AU45_mean	0.052	0.038	0.029	0.105	0.039

Note. The feature with the highest global importance per AU is reported to enhance interpretability.

Table 7. Within-person paired comparisons of AU activity between nascent and emergent understanding.

Action Unit	t	p (adj)	d^z	p (Wilcoxon)
AU4_mean_z	0.092	0.927	0.007	0.719
AU4_sd_z	−0.846	0.798	−0.068	0.477
AU7_mean_z	−3.946	0.001	−0.324	0.000
AU7_sd_z	−3.350	0.004	−0.275	0.003
AU12_mean_z	−0.115	1.000	−0.010	0.281
AU12_sd_z	−0.168	1.000	−0.014	0.109
AU15_mean_z	0.686	0.790	0.055	0.577
AU15_sd_z	1.460	0.390	0.117	0.109

Note. Significance was evaluated at α = 0.05.

Table 8. Distribution of within-person differences between nascent and emergent understanding.

Action Unit	M Δ	SD	% Positive	% Negative
AU4_mean_z	0.004	0.555	48.1	51.9
AU4_sd_z	−0.026	0.381	48.7	51.3
AU7_mean_z	−0.180	0.555	33.1	66.9
AU7_sd_z	−0.121	0.440	39.9	60.1
AU12_mean_z	−0.005	0.483	44.1	55.9
AU12_sd__z	−0.007	0.473	41.4	58.6
AU15_mean_z	0.032	0.591	50.6	49.4
AU15_sd_z	0.063	0.541	55.1	44.9

Note. Δ = emergent − nascent. Positive values indicate higher activity during emergent understanding.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lazic, M.; Woodruff, E. Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study. Electronics 2026, 15, 2059. https://doi.org/10.3390/electronics15102059

AMA Style

Lazic M, Woodruff E. Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study. Electronics. 2026; 15(10):2059. https://doi.org/10.3390/electronics15102059

Chicago/Turabian Style

Lazic, Milan, and Earl Woodruff. 2026. "Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study" Electronics 15, no. 10: 2059. https://doi.org/10.3390/electronics15102059

APA Style

Lazic, M., & Woodruff, E. (2026). Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study. Electronics, 15(10), 2059. https://doi.org/10.3390/electronics15102059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boundary Conditions for AU-Based Detection of Understanding: A Literary Analysis Study

Abstract

1. Introduction

2. Materials and Methods

2.1. Participants

2.2. Procedure

2.3. Measures

2.3.1. Phases of Understanding

2.3.2. Action Units

2.4. Data Analysis

2.4.1. Machine Learning

2.4.2. Within-Person Analysis

3. Results

3.1. Demographics

3.2. Descriptive Statistics

3.3. Machine Learning

3.4. Within-Person Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI