Review Reports - Active Inference Modeling of Socially Shared Cognition in Virtual Reality

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

1 The study employs a fixed session order where the "Bot Pair" condition always precedes the "Human Pair" condition. Since the bot was programmed to provide consistently incorrect feedback on "Hard" items in Session 3, this phase essentially functions as a training period that induces a specific state of cognitive uncertainty. How can the authors distinguish between genuine "collaborative convergence" in Session 4 and a simple carry-over effect or pre-adaptation to ambiguity from Session 3? Without counterbalancing, the claim that the model captures a real-time social sharing process rather than a sequential learning effect is highly questionable.

2 The model incorporates the DTW-based gaze synchrony index as a weight for the reward/risk term in the free energy functional. However, in social cognition, gaze synchrony is typically regarded as a consequence of joint attention or a by-product of successful interaction, rather than an explicit motivational driver for decision-making. The model assumes agents act "to stay in sync." Is this a biologically plausible mechanism for concept sharing, or is it a circular logic where the model uses the outcome of coordination to explain the cause of coordination?

3 The proposed dual-layer architecture relies on an Exponential Moving Average (EMA) for the "Double Bayes" metacognitive correction. Mathematically, this resembles a simple smoothing filter rather than a formal second-order inference over the partner’s internal model. What specific non-linear dynamics or representational advantages does the EMA layer provide that a properly parameterized Single Bayes layer cannot? Can the authors prove that the improved prediction accuracy isn't merely a result of basic recency bias in the data rather than the sophisticated "metacognitive reasoning" claimed?

4 The model’s performance is evaluated using data from only 14 participants. Achieving a self-prediction accuracy of 1.0 and partner prediction of 0.93 by tuning multiple hyperparameters on such a limited dataset strongly suggests overfitting. Did the authors perform any out-of-sample cross-validation? How can we be certain that the model reflects a generalized human cognitive process rather than an ex post facto curve-fitting of the idiosyncratic behaviors of these specific seven pairs?

Author Response

RESPONSE TO REVIEWER #1

We sincerely thank Reviewer #1 for the thorough and insightful comments. We have carefully addressed each concern and made substantial revisions to the manuscript accordingly. Below, we provide point-by-point responses to each comment.

================================================================================

Comment 1: Order Effects and Session Design

The study employs a fixed session order where the "Bot Pair" condition always precedes the "Human Pair" condition. Since the bot was programmed to provide consistently incorrect feedback on "Hard" items in Session 3, this phase essentially functions as a training period that induces a specific state of cognitive uncertainty. How can the authors distinguish between genuine "collaborative convergence" in Session 4 and a simple carry-over effect or pre-adaptation to ambiguity from Session 3? Without counterbalancing, the claim that the model captures a real-time social sharing process rather than a sequential learning effect is highly questionable.

Location: Section 1 (Introduction) and Section 4 (Discussion - Limitations)

Response:

We agree with the reviewer that the fixed session order raises important concerns regarding carry-over effects. Session 3 was not intended as a neutral control condition but was deliberately designed to establish a shared state of category ambiguity by pairing participants with a bot that consistently provides incorrect feedback on predefined items.

To address this point, we have revised the Introduction to clarify that the primary objective of this study is not to demonstrate a de novo emergence of shared cognition in Session 4, but to model how human–human pairs reconstruct and negotiate shared category concepts under pre-existing ambiguity within an active inference framework. We also explicitly state that the session order was fixed by design, because Session 3 serves to control the ambiguity context prior to Session 4.

In addition, we have added a limitation in the Discussion noting that, because counterbalancing was not implemented, potential order effects such as task familiarity cannot be fully excluded. Accordingly, comparisons across sessions—particularly regarding gaze synchrony—should be interpreted as exploratory.

Manuscript Changes:
- Section 1 (Introduction): Added explicit statement of study objectives and design rationale (end of Introduction section)
- Section 4 (Discussion/Limitations): Added acknowledgment of potential order effects (fourth paragraph of Limitations)

================================================================================

Comment 2: Role of Gaze Synchrony - Circular Logic Concern

The model incorporates the DTW-based gaze synchrony index as a weight for the reward/risk term in the free energy functional. However, in social cognition, gaze synchrony is typically regarded as a consequence of joint attention or a by-product of successful interaction, rather than an explicit motivational driver for decision-making. The model assumes agents act "to stay in sync." Is this a biologically plausible mechanism for concept sharing, or is it a circular logic where the model uses the outcome of coordination to explain the cause of coordination?

Location: Section 3.3.1 (Results - Active Inference Model Architecture) and Section 4 (Discussion)

Response:

We thank the reviewer for this important conceptual clarification. We agree that gaze synchrony should not be interpreted as an explicit motivational driver or as an outcome variable indicating successful agreement.

We have revised the manuscript to clarify that, in our model, gaze synchrony reflects largely unconscious processes of mutual attention and is used as a weighting parameter for the cooperative agreement term—that is, how strongly aligning one's action with the partner is emphasized during inference. Gaze synchrony therefore does not represent the result of agreement itself, but modulates the extent to which social information is incorporated into the inferential process leading toward agreement.

From this perspective, synchrony is not an optimized goal of the agents but a contextual factor reflecting an orientation toward coordination. We believe that this interpretation avoids circular reasoning and is consistent with accounts of social cognition in which synchrony emerges as a by-product of shared attention rather than as a consciously optimized objective.

Manuscript Changes:
- Section 3.3.1 (Results/Model Architecture): Added two paragraphs clarifying gaze synchrony's role as a weighting parameter (after Equation 6)
- Section 4 (Discussion): Added explanation that synchrony is a contextual factor, not an optimization target (second paragraph of Discussion)

================================================================================

Comment 3: EMA Layer and Metacognitive Inference

The proposed dual-layer architecture relies on an Exponential Moving Average (EMA) for the "Double Bayes" metacognitive correction. Mathematically, this resembles a simple smoothing filter rather than a formal second-order inference over the partner's internal model. What specific non-linear dynamics or representational advantages does the EMA layer provide that a properly parameterized Single Bayes layer cannot? Can the authors prove that the improved prediction accuracy isn't merely a result of basic recency bias in the data rather than the sophisticated "metacognitive reasoning" claimed?

Location: Section 3.3.1 (Results - Active Inference Model Architecture)

Response:

We thank the reviewer for this valuable comment. Upon reconsideration, we agreed that framing part of the model in terms of metacognitive inference was unnecessarily strong and potentially misleading.

We therefore removed the EMA-related description from the manuscript and revised the text to consistently describe the relevant processes in terms of cognitive belief updating, without invoking explicit metacognitive or formal second-order Bayesian inference. This revision clarifies the intended scope of the model and avoids overinterpretation of its theoretical implications.

Manuscript Changes:
- Section 3.3.1 (Results/Model Architecture): Replaced all instances of "metacognitive" with "iterative belief refinement"
- Section 3.3.1: Changed "metacognitive corrections" to "iterative belief refinement" in model description
- Section 3.3.1: Updated terminology in Equations 5 description and surrounding text

================================================================================

Comment 4: Sample Size, Overfitting, and Generalization

The model's performance is evaluated using data from only 14 participants. Achieving a self-prediction accuracy of 1.0 and partner prediction of 0.93 by tuning multiple hyperparameters on such a limited dataset strongly suggests overfitting. Did the authors perform any out-of-sample cross-validation? How can we be certain that the model reflects a generalized human cognitive process rather than an ex post facto curve-fitting of the idiosyncratic behaviors of these specific seven pairs?

Location: Section 4 (Discussion - Limitations)

Response:

We thank the reviewer for this important comment. We agree that the dataset used in the present study is relatively small and that parts of the analysis should be regarded as exploratory.

We have revised the Discussion section to explicitly acknowledge the potential risk of overfitting and to clarify the intended scope of the study. The primary aim of this work is not to establish definitive generalization performance, but to provide a proof of concept illustrating how collaborative interaction in virtual reality can be modeled within an active inference framework. Future work using larger datasets and independent samples will be necessary to further evaluate the robustness and generalizability of the proposed approach.

Manuscript Changes:
- Section 4 (Discussion/Limitations): Added new sixth paragraph acknowledging sample size limitations and potential overfitting
- Section 4 (Discussion/Limitations): Clarified that the study is a proof of concept rather than a definitive validation

================================================================================

We hope that these revisions adequately address the reviewer's concerns and strengthen the manuscript. We are grateful for the thoughtful feedback that has helped us clarify the scope and interpretation of our work.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present results of their study on modeling of socially shared cognition utilizing virtual reality. The aim was to deepen the understanding of how socially shared cognition is achieved.

They performed an experiment with 14 participants - students. However, demographic information is missing - age, gender, and also information about previous experience with VR.

4 different sessions were designed. The participants had to recognize shown objects and touch according to previous instruction. From the description, some details are not clear.

How many objects were altogether in each category (Kitchen, Garage) and difficulty level (easy, hard)? I suggest to present a table with all used objects.
How were the items selected in each session?
Did all participants see the same items in the same session?
How many items were shown in each session?
In session 3, were there 14 pairs (human-avatar)?
Was the appearance of the avatar the same for all participants?
How many pairs were in the human-human session? How were they selected?
Lines 166 and 167 describe more complex objects. Please, be more specific in the description. Explain what you mean by collaborative judgment in this experiment.

Author Response

RESPONSE TO REVIEWER #2

We sincerely thank Reviewer #2 for the constructive feedback and helpful suggestions. We have carefully addressed each comment and made the requested clarifications and additions to the manuscript. Below, we provide point-by-point responses to each comment.

================================================================================

Comment 1: Missing Demographic Information

They performed an experiment with 14 participants - students. However, demographic information is missing - age, gender, and also information about previous experience with VR.

Location: Section 2.1 (Method - Participants)

Response:

We thank the reviewer for pointing out the missing demographic information. We agree that such information is essential for transparency and reproducibility. Accordingly, we have revised the Participants section to include participants' age, gender distribution, and prior experience with virtual reality experiments.

Manuscript Changes:
- Section 2.1 (Participants): Added the following information in the second sentence:
"Participants consisted of 14 university students (mean age = 20.43 years, SD = .50), including 9 females and 5 males. All participants had normal or corrected-to-normal vision. Participants' prior experience with virtual reality was not formally controlled. However, the recruitment criteria specified that participants should have no prior experience participating in VR experiments."

================================================================================

Comment 2: Stimulus Set Description

How many objects were altogether in each category (Kitchen, Garage) and difficulty level (easy, hard)? I suggest to present a table with all used objects.

Location: Section 2.3 (Method - Apparatus and Stimuli)

Response:

We thank the reviewer for this helpful suggestion. We agree that a clearer description of the stimulus set improves the transparency and reproducibility of the study. Accordingly, we have revised the Methods section to explicitly describe the composition of the stimulus set, including the number of objects, their category (Kitchen / Garage), difficulty level (Easy / Hard), and presentation conditions.

Rather than presenting the stimuli in a separate table, all stimulus objects are explicitly shown in the corresponding figure, which allows direct visual inspection of the complete stimulus set.

Manuscript Changes:
- Section 2.3 (Apparatus and Stimuli): Added the following description in the paragraph describing reaction times and objects:
"The objects used for the category judgment task consisted of 12 stimulus items, with three objects assigned to each combination of category (Kitchen / Garage) and difficulty level (Easy / Hard) (see Fig. 2). Each object was presented with the handle positioned either on the left or the right side, resulting in a total of 24 target stimuli that were randomly presented during the experiment."

================================================================================

Comment 3: Item Selection and Presentation Procedure

How were the items selected in each session? Did all participants see the same items in the same session? How many items were shown in each session?

Location: Section 2.4 (Method - Experimental Design)

Response:

We thank the reviewer for these important questions. We have revised the Methods section to clarify the stimulus selection and presentation procedure. All target objects selected in the preliminary experiment were used in the main experiment. The presentation order was randomized while controlling for equal occurrences of each category and difficulty level. In addition, all participants were presented with the same set of target objects within each session, and the number of items presented per session is now explicitly reported.

Manuscript Changes:
- Section 2.4 (Experimental Design): Added new paragraph after the Design Rationale paragraph:
"All target objects selected in the preliminary experiment were used in the main experiment. The presentation order of the objects was randomized; however, the sequence was controlled such that each category (Kitchen and Garage) and difficulty level (Easy and Hard) appeared an equal number of times. Within each session, all participants were presented with the same set of target objects. Each session consisted of 24 trials corresponding to the 24 target stimuli (12 unique objects × 2 handle orientations)."

================================================================================

Comment 4: Session 3 Pairing and Avatar Appearance

In session 3, were there 14 pairs (human-avatar)? Was the appearance of the avatar the same for all participants?

Location: Section 2.1 (Method - Participants)

Response:

Yes. In Session 3, each participant interacted with a single avatar, resulting in a total of 14 human–avatar pairs. The avatar used in the experiment was a neutral human-like character selected from VRoid models permitted for research use, and both the avatar's appearance and action policy were identical across all participants. In the revised manuscript, we also clarify that the box-shaped avatar shown in the figure is provided as an illustrative example and does not depict the exact appearance used in the experiment.

Manuscript Changes:
- Section 2.1 (Participants): Added the following information in the paragraph describing Sessions 3 and 4:
"In Session 3, each of the 14 participants was paired with a bot avatar, creating 14 human-avatar pairs. All bot avatars used a neutral VRoid humanoid model with identical appearance and behavioral policies across all participants."

================================================================================

Comment 5: Session 4 Pairing Method

How many pairs were in the human-human session? How were they selected?

Location: Section 2.1 (Method - Participants)

Response:

In the human-only session (Session 4), the fourteen participants were paired into seven human–human dyads. Pairing was conducted according to a predefined experimental protocol, without any prior coordination or consideration of existing relationships between participants. This information has been clarified in the revised manuscript.

Manuscript Changes:
- Section 2.1 (Participants): Added the following information in the paragraph describing Sessions 3 and 4:
"In Session 4 (Human condition), participants were paired into 7 human-human pairs according to a predefined protocol based on consecutive participant IDs (e.g., IDs 1 and 2, 3 and 4). Pre-existing relationships between participants were not considered in the pairing protocol."

================================================================================

Comment 6: Operational Definition of Complexity and Collaborative Judgment

Lines 166 and 167 describe more complex objects. Please, be more specific in the description. Explain what you mean by collaborative judgment in this experiment.

Location: Section 2.3 (Method - Apparatus and Stimuli)

Response:

We appreciate the reviewer's request for clarification. In the revised manuscript, we now explicitly define object difficulty based on classification accuracy obtained in a preliminary experiment. Objects with higher accuracy were categorized as easy, whereas those with lower accuracy were categorized as hard.

In addition, we define collaborative judgment as the process by which participants resolve ambiguous category decisions through interaction with a partner rather than relying solely on individual judgment. This definition has been added to the manuscript to clarify the meaning of this term in the context of our experiment.

Manuscript Changes:
- Section 2.3 (Apparatus and Stimuli): Added operational definitions in the paragraph describing difficulty levels:
"Objects were operationally defined as 'more complex,' 'Hard,' or 'ambiguous' when they achieved categorization accuracy rates below .80 in preliminary testing, indicating substantial uncertainty in category membership. These high-difficulty objects included ambiguous items such as camping equipment that required collaborative judgment for accurate categorization. 'Collaborative judgment' refers to trials where both participants needed to interact to resolve category uncertainty, as the objects' ambiguous nature made individual categorization unreliable."

================================================================================

We hope that these revisions adequately address all of the reviewer's concerns and improve the clarity and completeness of the manuscript. We are grateful for the constructive feedback that has helped us strengthen the Methods section and improve the transparency of our experimental procedures.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Reviewer 2 Report

Comments and Suggestions for Authors

The authors responded satisfactorily all comments and modified the text accordingly.