Study 2 extended Study 1 by introducing mechanistic interpretability as a moderator and by formally testing the H3b moderated-mediation prediction (H2a, H2b, H3a, H3b).
5.1. Method
Participants and design. A power analysis for a 2 × 2 between-subjects ANOVA detecting a small-to-medium interaction (f = 0.15) with = 0.05 and power = 0.80 yielded a minimum N of 351. To accommodate exclusions and ensure equal cell sizes, we recruited 400 participants from Credamo (excluding those who had previously taken part in the pilot or Study 1) and randomly assigned them, with stratification on sex, to one of four cells in a 2 (AI empathy, high versus low) × 2 (mechanistic interpretability, present versus absent) between-subjects design. After applying the same exclusion rules as Study 1, 355 participants remained (88.8% retention; Mage = 27.2, SD = 5.6; 57.7% female; 90.7% bachelor’s degree or higher; cell sizes ranged from 84 to 93).
Stimuli and the mechanistic-interpretability manipulation. The four conditions were created by orthogonally crossing AI empathy (operationalized as in Study 1) with mechanistic interpretability (present versus absent). The two mechanistic-interpretability-absent conditions (cells C1 and C2) showed the same chat-only screenshots used in Study 1. The two mechanistic-interpretability-present conditions (cells C3 and C4) added a mechanistic interpretability panel directly below the AI agent’s reply, displaying the AI’s internal computation in real time. The panel comprised two stacked cards.
The upper card (purple header) was labeled AI Self-Activation of Internal Emotion Vectors and listed six named emotion-related feature vectors inside the language model. The vectors were Care, Empathy, Concern, Compassion, Warmth, and Resonance, each rendered as a horizontal bar with its activation magnitude (a value between 0 and 1) printed at the right end of the bar. A chip in the top right read 171 dimensions, anchoring the visualization to Templeton et al.’s [
11] identification of 171 emotion-related feature vectors inside Claude 3 Sonnet, and a footer noted that the activations were computed in real time. The lower card (green header) was labeled AI Activated Neural Network Modules and listed five named modules engaged in producing the reply, each tagged with a layer index. The five modules were the Emotion Recognition Network · Layer 12, the Empathic Attention Engine · Layer 14, the Emotional Comfort Generator · Layer 18, the High-Emotion Compensation Strategy · Layer 22, and the Care Response Module · Layer 24.
The MI panel content was held constant across the two mechanistic-interpretability-present cells (cells C3 and C4), so the MI manipulation isolated the presence versus absence of the panel itself rather than varying panel content alongside the AE manipulation. Holding panel content constant tested whether mechanistic transparency, on its own, raised perceived authenticity sufficiently to rehabilitate empathic AI, without confounding the test with differential panel content across the AE conditions.
The panel design drew on three sources. The visualization conventions followed Karny et al. [
12], who introduced a Neural Transparency Interface for human–AI interaction. We adapted their D3.js sunburst visualization into a static two-card layout suitable for online stimulus delivery. The 171-vector representation invoked Templeton et al. [
11], whose feature-extraction work on Claude 3 Sonnet identified emotion-related directions in the model’s internal activations and provided technical authority for our claim. Finally, the manipulation operationalized Mind Perception Theory [
8,
35] by making the AI’s internal experiential states visible to consumers, which our theory predicted would trigger reattribution along the experience dimension of mind perception. A single annotated screenshot of cell C4 (high AI empathy plus mechanistic interpretability present), with the two manipulated regions outlined as boxes 1 and 2, appears in
Figure A1.
Stimulus design and construct status. The mechanistic-interpretability panel in cells C3 and C4 was a consumer-facing stimulus designed in the style of mechanistic-interpretability research, not a live readout from a running language model with measured internal activations. The visualization conventions followed Karny et al. [
12], whose Neural Transparency Interface introduced a bar-chart-plus-module-card layout for surfacing internal model activity to non-specialist viewers. The 171-vector framing invoked Templeton et al. [
11], who identified 171 emotion-related feature directions inside Claude 3 Sonnet, and grounded the panel’s technical content in published interpretability research. The specific bar values and module names rendered to participants were chosen and held constant across the two MI-present cells in the manner of a standard vignette-based stimulus; they varied across cells only by the presence or absence of the panel itself. We refer to this stimulus as a
mechanistic-interpretability cue throughout the manuscript and reserve the term
live mechanistic interpretability for the engineering case in which the displayed activations were read out from a running model. The two are theoretically continuous (the cue invokes the same signaling and mind-perception mechanisms as a live readout would), but they differ in the cost the firm bears to produce them, which we discuss as a limitation in
Section 6.3. The Study 2 manipulation check accordingly measured the visibility of the panel’s vector and module content rather than perceived signal cost or verifiability.
Procedure. The procedure mirrored Study 1, with one addition. A 3-item mechanistic-interpretability manipulation check appeared after the AI-empathy manipulation check. Items were self-developed for this research and targeted the three components of the panel (vector visibility, module visibility, and overall mechanistic understanding).
Measures. All measures from Study 1 were retained with identical wording. Reliability values in Study 2 were = 0.96 (PETS), = 0.96 (perceived authenticity), and = 0.94 (brand intimacy). The mechanistic-interpretability manipulation check ( = 0.91) was added as described above.
5.2. Results
Sample, exclusions, and reliability. Of 400 participants recruited, 355 were retained (88.8% retention). Sample composition is reproduced in
Table 1. Reliability values were
= 0.96 (PETS),
= 0.96 (perceived authenticity),
= 0.94 (brand intimacy), and
= 0.91 (mechanistic-interpretability manipulation check). The complete measurement-model panel is reproduced in
Appendix C,
Table A6.
Manipulation checks. Both manipulations succeeded. For AI empathy, Mhigh = 4.34 (SD = 0.77) versus Mlow = 3.65 (SD = 0.88), t(353) = 7.87, p < 0.001, d = 0.84. For mechanistic interpretability, Mpresent = 4.37 (SD = 0.90) versus Mabsent = 3.52 (SD = 0.86), t(353) = 9.19, p < 0.001, d = 0.98. The smaller AI-empathy effect in Study 2 relative to Study 1 reflects the increased cognitive load of the four-cell design and is consistent with the dampening typically observed in factorial replications. Both effects remain in the large-effect range.
H1a (replication), H2a, and H3a, the 2 × 2 ANOVA on brand intimacy. A 2 (AI empathy) × 2 (mechanistic interpretability) between-subjects ANOVA on brand intimacy (
Table 2) revealed a significant AI-empathy main effect,
F(1, 351) = 106.12,
p < 0.001,
= 0.232, replicating Study 1’s H1a. The mechanistic-interpretability main effect was substantially larger,
F(1, 351) = 252.08,
p < 0.001,
= 0.418, supporting H2a. Critically, the interaction was also significant,
F(1, 351) = 34.94,
p < 0.001,
= 0.091, supporting H3a.
The interaction took the predicted attenuation form on brand intimacy. Under mechanistic interpretability absent, high AI empathy produced substantially lower brand intimacy than low AI empathy (
Mhigh = 2.13,
SD = 0.77;
Mlow = 3.92,
SD = 1.06), replicating the Study 1 backfire pattern with greater magnitude in the cleaner contrast of the four-cell design. Under mechanistic interpretability present, the gap between high and low AI empathy was substantially attenuated (
Mhigh = 4.56,
SD = 1.15;
Mlow = 4.95,
SD = 1.09), bringing high-empathy brand intimacy close to, though slightly below, low-empathy brand intimacy. For the cell-mean comparison, brand intimacy was highest in cell C3 (low AI empathy, mechanistic interpretability present;
M = 4.95) and second-highest in cell C4 (high AI empathy, mechanistic interpretability present;
M = 4.56); the two mechanistic-interpretability-present cells together substantially exceeded the two mechanistic-interpretability-absent cells. The four cell means on brand intimacy are visualized in
Figure 3, panel (a).
The same interaction on perceived authenticity (the mediator). A parallel 2 × 2 ANOVA on perceived authenticity revealed the same overall pattern but with a critical structural difference. The simple effect of AI empathy on perceived authenticity reversed sign across the levels of mechanistic interpretability. Under mechanistic interpretability absent, high AI empathy produced lower perceived authenticity than low AI empathy (
Mhigh = 2.40,
SD = 0.77;
Mlow = 3.82,
SD = 0.80; difference = −1.42 scale points). Under mechanistic interpretability present, high AI empathy produced slightly higher perceived authenticity than low AI empathy (
Mhigh = 4.50,
SD = 0.78;
Mlow = 4.30,
SD = 0.89; difference = +0.20 scale points). The same shift appeared in the
a path of the moderated mediation, reported in the next paragraph, where the simple slope of AI empathy on perceived authenticity equaled
a1 = −1.415 when mechanistic interpretability was absent and
a1 +
a3 = +0.197 at when mechanistic interpretability was present. The four cell means on perceived authenticity are visualized in
Figure 3, panel (b).
H3b, moderated mediation. The central test asked whether the AI-empathy slope on perceived authenticity reversed sign across mechanistic interpretability and whether the negative conditional indirect effect of AI empathy on brand intimacy documented in Study 1 was neutralized when mechanistic interpretability was visible. We tested this prediction using PROCESS Model 7 (first-stage moderated mediation) with 5000 bias-corrected bootstrap resamples. The full model is reported in
Table 3.
The revised interpretation separated three claims that the original framing collapsed into one. First, the simple slope of AI empathy on perceived authenticity reversed sign across the levels of mechanistic interpretability (
a1 = −1.415 with MI absent vs.
a1 +
a3 = +0.197 with MI present), which was statistically supported and which we describe as a
structural slope reversal on the mediator. Second, the Index of Moderated Mediation (
a3 ×
b) was +1.544, 95% BC CI [+1.220, +1.869], with the interval well clear of zero, supporting H3b. Third, the AI-empathy by mechanistic-interpretability interaction on perceived authenticity was strong (
a3 = +1.612,
p < 0.001), and the mediator-outcome path remained large (
b = +0.957,
p < 0.001). The conditional indirect effect was substantially negative when mechanistic interpretability was absent (
= −1.355, 95% BC CI [−1.599, −1.108]), replicating, and amplifying due to the cleaner contrast within the 2 × 2 design, the Study 1 backfire. When mechanistic interpretability was present, the conditional indirect effect was small and positive in point estimate (
= +0.189) but its 95% BC CI overlapped zero ([−0.044, +0.415]). The defensible substantive conclusion is therefore that the negative indirect effect of AI empathy on brand intimacy through perceived authenticity, present when mechanistic interpretability is absent, is
neutralized when mechanistic interpretability is present. A reliable positive indirect effect with MI present was not established, although the data were consistent with one and a sufficiently powered replication may reveal it. The conditional indirect effects are visualized in
Figure 4.
Robustness. Under a tighter completion-time window (fifth to 95th percentile; in Study 2, 335–1056 s,
n = 319), the Index of Moderated Mediation was +1.623, 95% BC CI [+1.257, +1.982], substantively identical to the headline +1.544 [+1.220, +1.869]; the conditional indirect effects retained their signs and inferential status (with MI absent,
= −1.383 [−1.645, −1.120]; with MI present,
= +0.240 [−0.024, +0.502]). The demand-characteristics caveat and the correlational status of the mediator-outcome path are discussed in
Section 6.3.