Previous Article in Journal
Phonetics and Phonology of Ibero-Romance Languages: An Introduction to the Special Issue
 
 
Article
Peer-Review Record

Proximity Loses: Real-Time Resolution of Ambiguous Wh-Questions in Japanese

Languages 2025, 10(12), 288; https://doi.org/10.3390/languages10120288
by Chie Nakamura 1,*, Suzanne Flynn 2, Yoichi Miyamoto 3 and Noriaki Yusa 4
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Languages 2025, 10(12), 288; https://doi.org/10.3390/languages10120288
Submission received: 12 August 2025 / Revised: 12 November 2025 / Accepted: 12 November 2025 / Published: 26 November 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

see attached

Comments for author File: Comments.pdf

Author Response

We thank the reviewer for the detailed and constructive comments, which have helped us clarify our assumptions and strengthen our empirical support. Below, we address each point in turn.

Comment 1:

The stipulated linking hypothesis assumes that readers actively look at the image associated with the wh-element, but there is no independent support for this assumption. Eye movements might instead simply follow the mention of events (e.g., “tell” → schoolyard, “catch” → park), and the same patterns could be explained without reference to wh-interpretation.

Response:
We agree that it is crucial to evaluate whether gaze patterns reflect syntactic interpretation or simply track mentioned events. In the revised manuscript, we have addressed this alternative explanation explicitly. We added a new subsection (Section 3.2.4) reporting eye-tracking results from the unambiguous condition, which serves as a baseline. If eye movements only followed event mention, then participants should have shifted their gaze toward the EC picture whenever the EC event was described, even in the unambiguous condition. However, the new analyses show that participants consistently maintained a preference for the MC picture throughout the unambiguous sentences, with no temporary increases in EC fixations when EC events were mentioned. This provides independent support for the linking hypothesis that gaze patterns reflect incremental wh-interpretation rather than simple referential alignment.

Comment 2:

The absence of eye-tracking data for the unambiguous condition is surprising, since this was part of the design. The unambiguous condition is needed as a baseline for interpreting the ambiguous condition.

Response:
We thank the reviewer for pointing this out. In the revised manuscript, we now report the full permutation analyses for the unambiguous condition (Figures 5 and 6). These analyses revealed a robust overall preference for the MC picture, with significant clusters at several time windows (e.g., around 1000 ms, 2000 ms, and at the sentence-final region). Including this baseline allows us to directly compare ambiguous and unambiguous conditions. The contrast confirms that the temporary EC fixations observed in the ambiguous condition reflect interpretive evaluation of structural alternatives, not simply the mention of events.

Comment 3:
Without a baseline condition, the observed patterns could be explained entirely by participants looking at the picture of the event being described (e.g., schoolyard after “tell,” park after “catch”).

Response:
We have revised the Results text and figure captions to clarify how the ambiguous and unambiguous conditions diverge. The ambiguous condition shows temporary EC looks at points where the embedded event is described, whereas the unambiguous condition does not. This difference rules out a purely event-mention account. We highlight this contrast in the new section 3.2.4 in the Results section.

Comment 4:
Figures 3 and 4 could be interpreted as consistent with the event-mention account.

Response:
The addition of the unambiguous analyses addresses this concern directly. By showing that gaze patterns in the unambiguous condition remain anchored to the MC picture, even when EC events are mentioned, we demonstrate that the ambiguous condition involves additional interpretive processes. This strengthens our claim that eye-movements provide evidence for incremental wh-interpretation.

Other minor comment:
No data from the distraction picture was reported in the figure or in the text.

Response:
We thank the reviewer for pointing this out. We have added a clarifying footnote in the Results section. As noted there, the distractor picture attracted on average about 11% of looks, peaking at ~23% early in the sentence and decreasing to below 10% thereafter. Because this rate is well below chance level (33% for three pictures), we considered these data not theoretically informative and therefore did not include them in the reported results.

We are grateful for this set of comments, which led us to include the unambiguous baseline analysis. This addition clarifies the interpretation of our results and directly rules out the possibility that gaze patterns simply reflect event mentions. We believe the revised manuscript now provides a stronger empirical foundation for our linking hypothesis.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a visual-world eye-tracking experiment, showing evidence that Japanese speakers do not resolve wh-dependencies at the (linearly) earliest possible position, but rather prefer resolving the attachment at a structurally more prominent position.

My biggest concern is with the within-subject question-condition manipulation (ambiguous vs. unambiguous). The unambiguous condition forces a main clause attachment parse. These constitute essentially half of the critical items, more than enough to prime the participants for a main clause attachment parse for the ambiguous condition. Therefore, the main finding of the paper could just be an experimental artifact and does not actually reflect the default attachment resolution strategy of Japanese speakers. Ideally the experimenters should run a version of this that does not contain the unambiguous condition. Less ideally, I imagine one could add presentation order as a factor and test if the offline responses change throughout the course of the experiment (if the rate of MC responses increases over time, that’s a pretty good indicator of priming happening).

Relatedly, towards the end of the discussion section, the author(s) mentioned the difference between Omaki et al.’s (2014) results and the current findings, and suggested that the previous designs could not capture the fine-grained time-course of filler-gap dependency formation. But the offline judgment of the current study (figure 2) still shows the opposite pattern of what Omaki et al. found in essentially the same offline task. More explanation should be provided for the difference between Omaki et al. (2014) and the results in figure 2.

Some more minor points:

  • The participants were explicitly told to “indicate their final interpretation of the wh-phrase’s referent by selecting a picture”. Depending on how the instructions are worded, the participants might already know that the sentences are true questions (as opposed to sentences with wh-indefinites that do not have clear referents) and don’t need to wait for the sentence-final question particles, contrary to the author’s interpretation of the results. It would be helpful to provide the actual instructions given to the participants.
  • The glosses in (7) are not aligned, making them really hard to read.
  • Some texts in the paper are randomly italicized for no apparent reason.

Author Response

We thank the reviewer for raising these important points. We respond to each in turn.

Comment 1:
The unambiguous condition may have primed participants for main-clause (MC) interpretations. Since these constitute half of the critical items, the main finding could be an artifact of priming. Ideally, the experiment should be re-run without the unambiguous condition; less ideally, presentation order could be tested to see if MC responses increase across the experiment.

Response:

We thank the reviewer for raising this important concern. To evaluate the possibility that the inclusion of unambiguous items might have primed participants toward a main clause interpretation in the ambiguous condition, we reanalyzed the offline key-press responses with trial order included as a fixed effect in a mixed-effects linear model (random intercepts for participants and items). The results showed a significant bias toward selecting the MC picture (β = 0.52, SE = 0.08, t = 6.82, p < .001), but crucially no effect of trial order (β = 0.0005, SE = 0.002, p = 0.81). This indicates that participants’ preference for the MC interpretation remained stable throughout the experiment and was not driven by cumulative exposure to unambiguous items. We have added this analysis to Section 3.1 of the revised manuscript and clarified in the Discussion that the results are unlikely to be an artifact of priming from the unambiguous condition.

Comment 2:
The offline results (Figure 2) differ from Omaki et al. (2014), who found the opposite pattern in a similar task. More explanation should be provided for this discrepancy.

Response:

We agree with the reviewer that the divergence between our offline responses and those of Omaki et al. (2014) requires clarification. We have expanded the Discussion to address this point directly. In particular, our participants showed a preference for the MC interpretation, whereas Omaki et al. reported an EC preference. We argue that this discrepancy is likely due to differences in task design. Omaki et al.’s experiment involved long narrative contexts introducing four different locations sequentially, with the EC location mentioned most recently before the target wh-question. This design makes the EC location more accessible at the point of answering, and the additional memory demands of four locations may have further promoted reliance on the most recently mentioned location. Moreover, their study included only 16 participants and eight items, raising concerns about statistical power. We suggest that these factors may have contributed to the EC preference in their offline task, rather than reflecting a robust default bias. In contrast, our design minimized narrative load, balanced the visual referents, and tracked interpretation online, revealing that participants ultimately resolved the dependency at the MC site. These revisions now appear at the end of the Discussion section (Section 4).

Other minor Comment:
The participants were explicitly told to “indicate their final interpretation of the wh-phrase’s referent by selecting a picture”. Depending on how the instructions are worded, the participants might already know that the sentences are true questions and don’t need to wait for the sentence-final question particles, contrary to the author’s interpretation of the results. It would be helpful to provide the actual instructions given to the participants.

Response:
We thank the reviewer for pointing this out. We agree that it is important to clarify the exact instructions. The instructions given to participants were: “In this experiment, you will hear a short sentence twice, and a question following it. Your task is to choose one of the pictures for your answer by pressing 1, 2, or 3.” These instructions have now been included in the Appendix of the manuscript for transparency.

This instruction indeed made clear that participants were answering a question. Importantly, however, the eye-tracking results show that despite this knowledge, participants still waited until the appearance of the clause-final particle -ka to finalize their interpretation. This supports our claim that Japanese comprehenders adopt a structurally cautious parsing strategy, withholding commitment until reliable syntactic cues become available, contrary to what would be expected under an active gap-filling account.

Reviewer 3 Report

Comments and Suggestions for Authors

Overall Evaluation

As this study explains, most prior evidence for the Active Filler Strategy comes from head-initial languages such as English and German, where readers consistently show a preference for main-clause interpretations in ambiguous wh-questions due to an early gap-resolution bias. For example, in a sentence like Where did Lizzie tell someone that she was going to catch butterflies? (p.2), comprehenders tend to resolve the dependency at the main clause site. Two main explanations have been proposed for this tendency. The structural account suggests that comprehenders prefer syntactically higher positions (e.g., the matrix clause) because of structural simplicity and grammatical prominence. The locality account, on the other hand, proposes that comprehenders favor gap sites that are linearly closer to the filler, in line with Dependency Locality Theory. A limitation of the English evidence is that the matrix gap is both structurally higher and linearly closer, making it hard to tell which factor is responsible.

Japanese provides an important test case because the embedded clause comes before the main clause. In this configuration, the structural account predicts attachment to the higher main verb, while the locality account predicts attachment to the linearly closer embedded verb. The present study made effective use of this property of Japanese to ask whether comprehenders, when resolving long-distance dependencies in real time, link the filler to a gap as early as possible (an active resolution strategy), or whether they instead wait for more structural information (a wait-and-see strategy). This is a carefully designed and theoretically important investigation that uses Japanese word order to distinguish between these two competing explanations.

The study combines key-press tasks with eye-tracking and cluster-based permutation analysis to trace the moment-by-moment course of interpretation. The eye-tracking results show early predictive fixations to the main clause (MC) picture, temporary shifts to the embedded clause (EC) picture when the embedded event was mentioned, and a return to the MC picture once the clause-final particle -ka appeared. Crucially, these shifts occurred before the appearance of explicit EC cues (e.g., -to), suggesting that temporary EC looks reflect referential alignment rather than syntactic reanalysis. Both key-press and eye-movement data show that participants did not commit to the first grammatically available EC interpretation. Instead, they engaged in predictive but structurally cautious parsing, waiting until clause-final evidence confirmed the interrogative use of the wh-phrase and ultimately favoring the structurally higher MC site.

Taken together, the findings support structural rather than linear locality in Japanese filler-gap resolution. They also challenge previous claims that Japanese processing is primarily guided by linear proximity, and they highlight the importance of language-specific morpho-syntactic cues such as clause-final particles. By combining fine-grained temporal data, this manuscript makes a clear contribution to the literature. Overall, it is a well-executed study that advances our understanding of filler-gap resolution across languages. While the paper is clearly written and well argued, I believe that adding some clarifications would make the contribution even stronger.

 

Page 6 — In section b. MC-EC event order condition, the text is indented, but in a. EC-MC event order condition it is not. Please align the formatting of these two subsections. In b. MC-EC event order condition, ami-o appears as am-o. This should be corrected.

 

Procedure (p. 6-7) ---- In many previous studies on this topic, the stimulus sentences were often quite complex and difficult to interpret. By contrast, this study carefully selected stimuli that were much easier to understand, which I find very good. However, I do have some concerns about the experimental design. The combination of context and wh-question manipulations produced a 2 × 2 factorial design, resulting in four experimental conditions, with each participant completing 24 target items and 36 fillers. With this setup, it seems difficult to fully counterbalance the materials, and participants may encounter the same or very similar structures multiple times. Could the authors clarify how the stimuli were actually presented? In particular, how was repetition managed, and how do the authors assess the possible influence of repeated exposure on the results?

 

Analysis for Accuracy (p. 7) — I find the accuracy results very interesting. As the authors noted, the effect was statistically significant, but rather ‘subtle’. However, the analysis method is not clearly described, making it difficult to know exactly how the data were evaluated. I wonder whether the results would hold under a more rigorous statistical approach. In particular, it may be important to consider both individual and item variability by using a linear mixed-effects logistic regression (e.g., glmer in R), and to include trial order as a predictor. Such an analysis would provide a more robust assessment of the effect, and I would be very interested to see whether the pattern observed here is confirmed.

 

Permutation analysis (p. 8) ---- The description of the cluster-based permutation analysis is appropriate and follows standard practice (e.g., Chang et al., 2018). The method appears statistically sound. It might be helpful if the authors could briefly justify choices such as the 20 ms bin size and the number of permutations to make the rationale for the analysis more transparent.

 

Figure and Results Presentation (p. 8-10) ---- I found Figure 3 to be very clear and effective in presenting the results. This visualization makes the subsequent whole-sentence analysis much easier to follow. Likewise, Table 1 presents the results in a simple and accessible way. Importantly, the interpretation following the appearance of the EC marker -to is also convincing: it provides the first unambiguous syntactic cue for an EC structure, making a structural reanalysis from an MC to EC interpretation necessary. The transition to the final analysis of gaze behavior in the sentence-final region is logical and well motivated. Overall, the results and interpretations are presented in a very clear and coherent manner.

 

Formatting (pp. 10–11) ---- The text on pages 10 to 11 appears to be entirely italicized. This seems unintentional. Please revert these pages to regular type.

 

EC-to-sentence-final region analysis (pp. 10–11) ---- Figure 4 clearly presented the eye-movement patterns recorded after the sentence presentation. The analysis appeared to be based on eye movements reflecting regression patterns across phrase-level regions throughout the entire sentence. The authors interpret the late shift in gaze toward the MC picture as indicating the resolution of the wh-attachment, consistent with participants’ final behavioral choices. As participants encountered the MC verb followed by the question particle (iimashita-ka, told-Q), they appeared to attach the fronted wh-phrase to the matrix predicate, adopting the MC interpretation. Table 2 also provides a clear and concise summary of these results. Overall, this section presents the data and interpretation in a clear, coherent, and convincing manner.

 

Discussion ---- In this section, the portions that were previously italicized no longer appear in italics. Please check the formatting and ensure consistency throughout the manuscript.

 

Line 479 ---- … such as the EC marker -to and the clause0final question particle -ka. This part should be ‘the clause-final.’

 

The present study shares some similarity with Tamaoka and Mansbridge (2019, Gengo kenkyu, 155, 35-63), who used eye-tracking to examine the processing of scrambled Japanese sentences. Their results showed that readers often made regressions from the verb back to the object noun phrase, especially when both the subject and object were animate, suggesting that readers revisited earlier constituents to complete sentence interpretation. Likewise, in the current study, regressions play a crucial role in identifying the specific location referred to by doko ‘where’, indicating that readers actively re-inspect relevant regions to establish structural and referential coherence during comprehension.

 

These sections of Discussion and Conclusion provide a comprehensive summary and discussion of the results. While there is some repetition, the content is closely aligned with the actual sentence-processing evidence, making the argumentation clear and accessible.

Author Response

We thank the reviewer for the very thoughtful and constructive comments, which highlight both the strengths of the study and areas where clarification and additional detail would improve the manuscript. We respond point by point below.

Comment 1:
Formatting issues: (i) indentation inconsistent between subsections, (ii) “ami-o” appears as “am-o”, (iii) pages 10–11 appear entirely italicized, (iv) “clause0final” should be “clause-final.”

Response:

We thank the reviewer for catching these formatting and typographical issues. In the revised manuscript, we have aligned the formatting of subsections and corrected the typos.

Comment 2:
Clarification on stimuli presentation and repetition. With a 2×2 design and 24 critical items, participants may have encountered similar structures multiple times. Please clarify how this was managed and assess possible influence of repetition.

Response:

We appreciate this concern and have added clarification to the Materials and Methods section. Each participant completed 24 target items distributed across the four experimental conditions, with items counterbalanced across lists. No participant saw the same item in both ambiguous and unambiguous versions; instead, each participant encountered each item in only one condition. Moreover, to test whether repetition or cumulative exposure influenced responses, we included trial order as a fixed effect in our key-press response analysis. As reported in Section 3.1, trial order did not significantly affect participants’ responses (β = 0.0005, SE = 0.002, p = .81), suggesting that repeated exposure did not drive the observed preference for the MC interpretation.

Comment 3:
Accuracy analysis: method not clearly described; suggest reanalyzing using mixed-effects logistic regression (e.g., glmer in R), including trial order as a predictor.

Response:

We thank the reviewer for this helpful suggestion. In the revised manuscript, we have clarified our analysis procedure and reanalyzed the data using mixed-effects models with random intercepts for participants and items, and with trial order included as a fixed effect.

For the filler and unambiguous questions, accuracy rates were high (90.6% and 85.1%, respectively), and the models confirmed these were significantly above chance. For the ambiguous wh-questions, we focused on participants’ offline picture choices, coding whether they selected the main clause (MC) picture out of the three response options (MC, EC, distracter). Because our interest was in the proportion of MC selections relative to all available responses, we analyzed the data using a linear mixed-effects model (lmer), which is well suited for modeling proportion data. This approach is also consistent with the accuracy analyses reported for filler and unambiguous conditions.

The results showed a significant preference for the MC picture (β = 0.52, SE = 0.08, t = 6.82, p < .001), with no effect of trial order (β = 0.0005, SE = 0.002, p = .81). Thus, participants’ bias toward the MC interpretation was robust, stable across trials, and not driven by repeated exposure. We have revised Section 3.1 to include this analysis and explanation.

Comment 4:
Permutation analysis: choices such as 20 ms bin size and number of permutations should be justified.

Response:

We thank the reviewer for this suggestion. In the revised manuscript, we have clarified the rationale for our analysis parameters. Following Chang et al. (2018), who note that permutation analyses should be conducted at the sampling rate of the eye-movement coding (every 33 ms in their study), we adopted the same principle. Our eye-tracking data was recorded at a higher temporal resolution (every 20 ms), and therefore we used 20 ms time bins to match the sampling rate of our data. For the number of permutations, we followed Chang et al.’s (2018) practice of using 1,000 permutations, which is sufficient to obtain stable estimates while keeping computational demands manageable.

Comment 5:

The reviewer highlights the relevance of findings from Mansbridge & Tamaoka (2019) on regressions in Japanese sentence processing, suggesting that regressions play a crucial role in establishing structural and referential coherence.

Response:
We thank the reviewer for pointing us to this relevant study. We have now incorporated a reference to Mansbridge & Tamaoka (2019) in the Discussion. While their focus was on scrambled sentences rather than wh-dependencies, their findings indicate that Japanese readers often revisit earlier material to achieve structural coherence. This pattern, consistent with our results, underscores that Japanese parsing cannot be reduced to simple linear locality, but instead reflects structure-sensitive processes. 

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed the issues I brought up and I think the paper can be published. 

Author Response

We thank the reviewer for the positive evaluation and for recommending the paper for publication. We appreciate the reviewer’s constructive feedback, which helped us improve the clarity and presentation of the manuscript.

Reviewer 2 Report

Comments and Suggestions for Authors

Thanks for address my comments from the previous round. Many of my questions are addressed in the revised manuscript. That said, I do still have some serious concerns about the statistical analysis for key-press responses: 

p.7, the opening sentence of section 3.1: the current text makes it seem like trial order was the only fixed effect included, which shouldn't be the case: the section goes on to report responses by condition (filler, ambiguous, unambiguous), so condition is necessarily a predictor (or perhaps the authors got those numbers through subset analysis, in which case it should be made clear). 

It also isn't clear what the beta estimates are for: for example, on line 305-307, it says " Participants’ results indicated high accuracy (90.6%) for filler items ... (β = 0.95, SE = 0.06, t = 16.83, p < .001)". What is this beta estimate for? What is the beta estimate for? Proportion of correct responses for the filler items? Please be explicit. 

Also, the choice of linear regression (as opposed to logistic regression) is problematic in this case. The response of key-press is a discrete variable (three choices), rather than a continuous numeric value. One could calculate the proportion of MC choices for each item/participant and treat that as the dependent variable, but then it wouldn't be possible to include by item/participant random intercepts (since there would only be one data point for each item/participant, depending on how the aggregation is done). The author(s) either didn't present the regression model correctly in 3.1, or the analysis was not statistically valid. 

 

 

Author Response

We thank the reviewer for these detailed comments, which helped us clarify the description of the behavioral analysis. We have revised Section 3.1 to make the model structure, the meaning of the reported β estimates, and the rationale for the mixed-effects modeling approach fully explicit.

The most substantial change is that we have reanalyzed the key-press response data using a generalized linear mixed effects model (GLMM) with a binomial link, following the reviewer’s suggestion. The overall response pattern remained consistent with that reported in the previous version; however, for ambiguous wh-questions, the GLMM did not reveal a significant bias toward the main clause (MC) interpretation, whereas a complementary aggregated t-test confirmed a reliable group-level bias. We believe that this difference arises because the GLMM, while statistically conservative and valuable for controlling random variability, may not fully capture subtle but consistent group-level trends when the effect size is small and response proportions are near chance (Jaeger, 2008; Barr et al., 2013; Luke 2017). In such cases, an aggregated t-test can provide a complementary perspective that reflects meaningful overall tendencies in the data. Accordingly, we report both analyses and interpret the results in light of their respective sensitivities and implications.

We have updated the corresponding Results section and the accompanying discussion to reflect these changes. Please see below for our detailed, comment-by-comment responses.

Comment 1:

p.7, the opening sentence of section 3.1: the current text makes it seem like trial order was the only fixed effect included, which shouldn't be the case: the section goes on to report responses by condition (filler, ambiguous, unambiguous), so condition is necessarily a predictor (or perhaps the authors got those numbers through subset analysis, in which case it should be made clear).

Response:

We thank the reviewer for pointing this out. In the revised manuscript, we clarified that the key-press responses were analyzed separately for each condition (filler, unambiguous, and ambiguous) using the same model structure: trial order as a fixed effect and random intercepts for participants and items in a GLMM. We now explicitly state that subset analyses were conducted for each condition type, as the dependent variable differed in interpretability across them (accuracy for filler and unambiguous items, interpretive preference for ambiguous items). These revisions should make the analysis structure and rationale clearer at the beginning of Section 3.1.

Comment 2:

It also isn't clear what the beta estimates are for: for example, on line 305-307, it says "Participants’ results indicated high accuracy (90.6%) for filler items ... (β = 0.95, SE = 0.06, t = 16.83, p < .001)". What is this beta estimate for? Please be explicit.

Response:

We thank the reviewer for this valuable comment. In the previous version of the manuscript, the β estimates reflected the model-estimated mean proportions of accuracy or MC choices (e.g., 0.5 for 50%), as the 0-1 data were analyzed as continuous proportions using linear mixed-effects models. However, following the reviewer’s advice, we have now reanalyzed the key-press response data using GLMMs with a binomial link. As a result, in the revised analyses, the β coefficients represent log-odds estimates of correct or MC responses rather than proportions.

Comment 3:

Also, the choice of linear regression (as opposed to logistic regression) is problematic in this case. The response of key-press is a discrete variable (three choices), rather than a continuous numeric value. One could calculate the proportion of MC choices for each item/participant and treat that as the dependent variable, but then it wouldn't be possible to include by item/participant random intercepts (since there would only be one data point for each item/participant, depending on how the aggregation is done). The author(s) either didn't present the regression model correctly in 3.1, or the analysis was not statistically valid.

Response:

We thank the reviewer for this valuable comment concerning the modeling approach for the key-press data. We fully agree that a linear regression approach is not optimal for categorical key-press responses. Following the reviewer’s suggestion, we have reanalyzed the data using GLMMs with a binomial link, which can more directly model the probability of a binary outcome while accounting for participant-and item-level variability (Jaeger, 2008).

As described in the revised analysis, this approach showed that accuracy was significantly greater than chance for filler and unambiguous questions. However, for ambiguous wh-questions, although there were numerically more MC responses than EC responses (54.9% vs. 43.7%), the GLMM did not reveal a significant preference for the MC over the EC picture (β = 0.29, SE = 0.28, z = 1.05, p = 0.296). Given that the observed difference between MC and EC choices was small, it is likely that once participant- and item-level variability were taken into account, the overall bias toward MC interpretations did not reach significance. Previous methodological discussions (e.g., Jaeger, 2008; Barr et al., 2013; Luke, 2017) have noted that when response proportions are close to chance (around 50%), a binary logit model has limited sensitivity to small graded differences, particularly when random effects are included.

Thus, given the limited sensitivity of mixed-effects models to small effects near chance level, we decided to conduct a complementary analysis at the aggregated level to assess whether a group-level bias was nevertheless present. For this purpose, responses were aggregated by participant and item, and the proportion of MC choices was compared against chance (.50) using a one-sample t-test. This analysis revealed that MC picture was chosen significantly more than the EC picture (M = 0.55, SD = 0.27, t(418) = 2.01, p = 0.045), indicating a small but reliable interpretive bias toward the MC interpretation.

The discrepancy between the non-significant GLMM result and the significant t-test likely reflects the conservative nature of mixed-effects models when accounting for random variability. While the GLMM provides a stringent test for participant- and item-level effects, the aggregated t-test offers a clearer view of the overall pattern at the group level. We therefore report both analyses to provide a more complete picture of the data. Together, these results suggest that while the bias toward the MC interpretation was not as robust as the clear accuracy observed for filler and unambiguous questions, the overall tendency was statistically reliable when responses were considered at the group level (i.e., across all participants and items combined).

Back to TopTop