Next Article in Journal
The Geometry of Modal Closure—Symmetry, Invariants, and Transform Boundaries
Previous Article in Journal
Generalized Repunit Hybrid Quaternions: Structural and Pre-Cryptographic Insights
Previous Article in Special Issue
Research on Large Language Model-Based Automatic Knowledge Extraction for Coal Mine Equipment Safety
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating Large Language Models on Chinese Zero Anaphora: A Symmetric Winograd-Style Minimal-Pair Benchmark

1
Institute of Modern Languages and Linguistics, Fudan University, Shanghai 200433, China
2
School of Data Science, Fudan University, Shanghai 200433, China
*
Author to whom correspondence should be addressed.
Symmetry 2026, 18(1), 47; https://doi.org/10.3390/sym18010047 (registering DOI)
Submission received: 17 November 2025 / Revised: 23 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025
(This article belongs to the Special Issue Symmetry and Asymmetry in Natural Language Processing)

Abstract

This study investigates how large language models (LLMs) handle Chinese zero anaphora under symmetric minimal-pair conditions designed to neutralize shallow syntactic cues. We construct a Winograd-style benchmark of carefully controlled sentence pairs that require semantic interpretation, pragmatic inference, discourse tracking, and commonsense reasoning rather than structural heuristics. Using GPT-4, ChatGLM-4, and LLaMA-3 under zero-shot, one-shot, and few-shot prompting, we assess both accuracy and the reasoning traces generated through a standardized Chain-of-Thought diagnostic. Results show that all models perform consistently on items solvable through local cues but display systematic asymmetric errors on 19 universally misinterpreted sentences that demand deeper discourse reasoning. Analysis of these failures reveals weaknesses in semantic role differentiation, topic-chain maintenance, logical-relation interpretation, pragmatic inference, and long-distance dependency tracking. These findings suggest that while LLMs perform well on simpler tasks, they still face challenges in interpreting contextually omitted arguments in Chinese. The study provides a new controlled evaluation resource, an interpretable error analysis framework, and evidence of differences in symmetric versus asymmetric reasoning behaviors in LLMs. Future research could expand the current benchmark to longer discourse contexts, incorporate multi-modal or knowledge-grounded cues, and explore fine-tuning LLMs on discourse data, helping clarify whether asymmetric patterns stem from deeper reasoning challenges or from interactions between models and the evaluation format.

1. Introduction

Zero anaphora refers to a linguistic phenomenon in which an obligatory argument is not overtly expressed, creating a pronoun-like gap that relies on a preceding discourse entity for its interpretation [1,2]. Because the missing referent provides no overt morphological cues, resolving zero anaphora requires integrating discourse structure, semantic relations, and commonsense reasoning [3]. These demands make it one of the most challenging forms of coreference for both traditional NLP systems and modern large language models (LLMs) [4].
The challenge becomes even more pronounced in Chinese. Among pro-drop languages, in which zero anaphora is pervasive, Chinese is particularly difficult because it lacks the morphological cues that facilitate antecedent identification in other languages. Unlike Spanish or Italian, where verb inflections constrain possible antecedents [5,6] or Japanese and Korean, where topic-marking particles support reference tracking [7,8], Chinese relies heavily on contextual information such as topic chains [9], semantic inference, and commonsense knowledge [10]. This leads to an inherent asymmetry between surface form and meaning, creating additional pressure on computational models [11]. Although progress has been made using resources such as the Chinese Treebank and OntoNotes [12,13,14], these corpora contain uncontrolled syntactic biases, overrepresent subject position zeros, and typically feature single, unambiguous antecedents. As a result, models often exploit structural heuristics rather than performing genuine semantic or pragmatic reasoning [15,16], making high reported performance an unreliable indicator of true reasoning ability [17,18].
This limitation motivates the search for evaluation methods that directly test discourse and commonsense reasoning. The Winograd Schema Challenge (WSC) offers a compelling paradigm because it uses carefully controlled minimal pairs that neutralize syntactic form and create structural symmetry between alternatives [18,19]. Importantly, this structural symmetry functions as an evaluation control rather than implying semantic equivalence: even when surface forms are matched, the underlying discourse cues may diverge, requiring models to engage in deliberately asymmetric reasoning [11]. Asymmetric reasoning leverages asymmetry in training and inference to guide models toward deeper, more invariant abstractions. It requires models to resolve tasks where surface symmetry masks underlying discourse or semantic differences, driving higher-level reasoning. This enhances robustness and generalization, especially in complex, real-world tasks like Winograd-style schemas [11,20]. However, existing WSC datasets focus exclusively on overt pronouns and do not address zero pronouns, despite their prevalence in Chinese. While Chinese adaptations of the WSC, such as Mandarinograd [21] and CLUEWSC [22], have been proposed, they are largely translation-based and are not designed to systematically evaluate zero anaphora under symmetric minimal-pair conditions. Consequently, there remains no established benchmark capable of assessing whether computational models can truly resolve Chinese zero anaphora, particularly under conditions designed to reveal asymmetric weaknesses in reasoning.
To bridge this gap, the present study has three main goals. First, it introduces a Winograd-style benchmark for Chinese zero anaphora using symmetric minimal-pair design to isolate semantic, discourse, and commonsense factors. Second, it establishes a prompting-based evaluation framework that employs zero-shot, one-shot, and few-shot settings to assess LLMs without dependence on supervised training or syntactic annotation. Third, it analyzes LLM reasoning using structured explanations and contrastive prompts to identify where symmetric inputs elicit asymmetric interpretations, thereby revealing the strengths, limitations, and systematic error patterns that characterize model behavior in Chinese discourse understanding.
The remainder of the paper is structured as follows. Section 2 reviews prior work on Chinese zero anaphora datasets and computational approaches. Section 3 introduces the benchmark design and human upper-bound estimation. Section 4 describes the experimental setup. Section 5 reports model performance and error overlap analysis. Section 6 discusses broader implications, as well as systematic error patterns, and Section 7 concludes with contributions, limitations, and future directions.

2. Related Work

2.1. Datasets for Chinese Zero Anaphora Resolution

Research on Chinese zero anaphora has relied primarily on treebank-based resources and, more recently, discourse-oriented corpora. Early work extended the Chinese Treebank (CTB) with manual zero-pronoun annotations [12,13], but these corpora are small, news-domain focused, and dominated by subject-position zeros. The Chinese portion of OntoNotes (https://catalog.ldc.upenn.edu/LDC2013T19 (accessed on 12 November 2025)) remains the most widely used benchmark, offering syntactic trees, semantic roles, and coreference chains [16]. Although zero pronouns can be reconstructed from empty categories, they are not manually annotated in context, and many instances exhibit low ambiguity, limiting their usefulness for evaluating deeper discourse and pragmatic reasoning [14,16].
Recent efforts target discourse-level phenomena, such as CDAMR [4] and narrative-focused linguistic corpora [23], but these resources are typically small and not standardized for NLP evaluation. Existing datasets support either a single-resolution paradigm based on gold syntactic cues or a more realistic recovery-and-resolution setting [24], yet few address the latter, and none adopt controlled minimal-pair designs. This is particularly important under conditions that aim to reveal asymmetric weaknesses in reasoning. As a result, current resources do not adequately assess semantic, pragmatic, or commonsense inference, nor do they test the reasoning capabilities of LLMs.
Importantly, OntoNotes-based zero anaphora typically involve a single salient antecedent and therefore require recovery rather than deliberate reasoning. In OntoNotes-Chinese, zero pronouns arise in naturally occurring discourse, and most instances contain only one salient antecedent, making the task one of recovery rather than genuine reasoning [25]. Antecedents are generally resolved through topic-chain continuity or other discourse-salience cues [23]. In contrast, the Winograd-style benchmark for Chinese zero anaphora (WSC-ZA) that we construct provides symmetric minimal pairs in which multiple antecedents remain equally accessible and pragmatic reasoning must resolve the zero pronoun. See the examples in (1) and (2).
(1)
游客们可以在这里了解香港的电影史,Ø也可以在这里近距离接触自己心目中的明星。
Visitors can explore the history of Hong Kong cinema here and even get close to the movie stars they admire.
(OntoNotes example)
(2)
a 市议员拒绝给示威者颁发游行许可证,因为Ø害怕暴力。
The city councilmen refused the demonstrators a permit because they feared violence.
b 市议员拒绝给示威者颁发游行许可证,因为Ø提倡暴力。
The city councilmen refused the demonstrators a permit because they advocated violence.
(WSC-ZA minimal pairs)
In (1), the omitted subject is pragmatically recoverable as “游客们” (visitors) through topic continuity, with no competing antecedents. In contrast, in (2), two discourse entities (“市议员” city councilmen and “示威者” demonstrators) remain equally accessible, and the zero pronoun must be resolved solely via pragmatic inference triggered by a minimal lexical alternation (“害怕” fear vs. “提倡” advocate), while all other surface cues are kept symmetric. Thus, WSC-ZA operationalizes controlled reasoning difficulty that naturalistic OntoNotes data cannot isolate, motivating the need for such a benchmark. Table 1 summarizes these distinctions, showing how WSC-ZA systematically increases reasoning difficulty through controlled ambiguity.

2.2. Methods for Chinese Zero Anaphora Resolution

Early computational approaches were rule-based or feature-driven, leveraging syntactic paths, binding constraints, and salience heuristics [12,13]. Neural models introduced more flexible context representations, including feed-forward encoders [14], memory-based and hierarchical architectures capturing long-distance cues [15,16,26], and attention-based models jointly representing zero pronouns and candidate antecedents [27]. Another strand treated zero anaphora as a machine reading comprehension task, using cloze-style augmentation [28] or semantic-dependency guidance [24].
Joint or multi-task frameworks further integrated zero-pronoun detection and resolution [29,30], while models encoding inter-sentence context sought to better capture discourse-level relations [31]. More recently, LLMs have renewed interest in the task, with surveys noting its relevance to discourse reasoning [32] and studies revealing persistent limitations in omitted-argument interpretation [33]. Attempts to use LLMs for zero pronoun completion in dialog [34] remain preliminary, and no controlled evaluation of Chinese zero anaphora has been established.
Overall, existing methods rely heavily on sentence-level cues and gold syntactic structures, making it difficult to assess whether systems truly perform semantic or commonsense reasoning. These limitations underscore the need for benchmarks explicitly designed to test reasoning-based zero anaphora understanding, especially those that neutralize superficial syntactic cues and reveal how models handle discourse-level complexity and asymmetric reasoning.

3. Winograd-Style Benchmark Construction

3.1. The Adaption of WSC to Chinese Zero Anaphora

The WSC evaluates a system’s ability to perform human-like reasoning through sentence pairs that differ minimally in one or two words, creating a controlled ambiguity that alters the antecedent interpretation [35]. This minimal-pair design establishes a structural and lexical symmetry between alternatives, requiring models to reply on semantic interpretation, pragmatic inference and commonsense knowledge rather than surface heuristics. Prior work shows that such symmetric stimuli are effective for isolating deeper semantic and pragmatic reasoning processes in LLMs [11,36]. Building on this principle, the Winograd paradigm provides an appropriate foundation for constructing a zero-anaphora benchmark that tests whether models can maintain stable reasoning when surface-level symmetry is held constant, but discourse-level cues introduce inherent asymmetry.
To adapt WSC to Chinese zero anaphora, we used The New Chinese Collection of Winograd Schemas [37], consisting of 284 sentences (142 schema pairs). All third-person pronouns, including 他 (he), 她 (she), 它 (it), 他们 (they-masc.), 她们 (they-fem.), 它们 (they-neut.) were systematically replaced with zero forms (Ø). Each transformed sentence was then manually reviewed to ensure that the resulting zero anaphora were grammatically acceptable and naturally licensed in Chinese. The schema pair in (3) provides a concrete demonstration of the zero-pronoun transformation.
(3)
a. 奖杯无法放进棕色的箱子里,因为Ø太大了。
The trophy doesn’t fit into the brown suitcase because it’s too large.
b. 奖杯无法放进棕色的箱子里,因为Ø太小了。
The trophy doesn’t fit into the brown suitcase because it’s too small.
As shown in (3), both “奖杯” (trophy) and “箱子” (suitcase) remain grammatically accessible and semantically compatible with the predicate “太大/太小” (“too large”/”too small”), so Ø can naturally refer to either entity. The interpretation depends solely on commonsense reasoning about spatial relations, whether the failure to fit results from the object being too large or the container too small, rather than any syntactic cue introduced by pronoun deletion. This demonstrates that the zero-pronoun transformation preserves the symmetry and reasoning structure of the original Winograd schema.
Sentence Exclusion Criteria
Sentences were removed when the omitted pronoun occurred in positions where zero anaphora is not permitted in Chinese. These included:
(i)
Objects of the ba-construction (e.g., (4)): The NP/pronoun following “ba” must be definite, and receives the patient role from the verb-complement complex. Without an overt NP/pronoun, the construction collapse because the verb lacks an argument to which its dispositional semantics can apply [38].
(ii)
Prenominal possessors (e.g., (5)): Chinese lacks independent possessive pronouns. Instead, possessors must be linked to the noun via “de”, forming an inseparable internal modifier. Most prenominal possessors function as obligatory modifiers rather than recoverable arguments [39] (Although kinship terms allow de-drop (e.g., “我妈妈” my mom), these forms remain obligatory attributive modifiers rather than recoverable arguments; therefore they do not license zero anaphora and were excluded).
(iii)
Direct-object pronouns of transitive verbs (e.g., (6)): Because zero objects in Chinese are rare and permitted only when discourse-recoverable [40], direct-object pronouns of transitive verbs, typically requiring a definite, interpretable object, can not be converted to zero anaphora.
(iv)
Other cases that do not fall into the above categories but violate basic licensing conditions for zero anaphora.
(4)
a. 我试着用钥匙开锁,但有人用口香糖堵住了锁眼,我无法把它插进去。
I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it in.
b. *我试着用钥匙开锁,但有人用口香糖堵住了锁眼,我无法把Ø插进去。
*I was trying to open the lock with the key, but someone had filled the keyhole with chewing gum, and I couldn’t get it in.
(5)
a. 那个男人把男孩放到了他的肩膀上。
The man lifted the boy onto his shoulders.
b. *那个男人把男孩放到了Ø的肩膀上。
*The man lifted the boy onto his shoulders.
(6)
a. 小明雇佣张峰来照顾他。
John hired Bill to take care of him.
b. *小明雇佣张峰来照顾Ø。
*John hired Bill to take care of him.
In total, 44 sentences fell into these categories and were excluded, leaving 240 well-formed sentences (120 schema pairs) in the final benchmark.
In addition to deletion-based filtering, minor adjustments were made to certain sentences to maintain naturalness and maintain minimal-pair symmetry. These adjustments included adding stance adverbials (e.g., 真是 really) to complete evaluative predicates (e.g., (7)); inserting copular elements (e.g., 是 is) to prevent unnatural clause structures (e.g., (8)); rephrasing predicates to ensure that the zero-argument remained discourse recoverable (e.g., (9)), and revising the expression of a few prenominal possessors to make the sentence more natural and make the zero anaphora identifiable (e.g., (10)). All modifications were minimal and did not alter the underlying ambiguity structure of the schema. The final dataset (WSC-ZA) consists of 240 well-formed sentences (120 schema pairs), and the number and types of deleted and modified sentences in WSC-ZA are summarized in Table 2.
(7)
a. ?狐狸们晚上钻进来吃小鸡,Ø越来越大胆。
The foxes are getting in at night and attacking the chickens. They have gotten very bold.
b.狐狸们晚上钻进来吃小鸡,Ø真是越来越大胆。
The foxes are getting in at night and attacking the chickens. They have gotten very bold.
(8)
a. ?小明把椅子拉到钢琴前,但Ø是坏的,所以他只好站着。
Sam pulled up a chair to the piano, but it was broken, so he had to stand instead.
b. 小明把椅子拉到钢琴旁,但Ø是坏的,所以他只好站着。
Sam pulled up a chair to the piano, but it was broken, so he had to stand instead.
(9)
a. *老生欺负新生,所以我们惩罚了Ø。
The older students were bullying the younger ones, so they got our punishment.
b. 老生欺负新生,所以Ø得到了我们的惩罚
The older students were bullying the younger ones, so they got our punishment.
(10)
a. *父亲把熟睡的男孩放进了Ø的摇篮。
The father carried the sleeping boy in his bassinet.
b.父亲把熟睡的男孩放进了Ø摇篮
The father carried the sleeping boy in his bassinet.

3.2. Validation of WSC-ZA

To ensure linguistic and discourse validity, we conducted a two-stage validation with five native Chinese speakers. Native-speaker acceptability judgment is a standard methodology in linguistics and NLP evaluation [41,42]. All raters had graduate-level training in linguistics and were not involved in dataset construction. The naturalness-validation task required fine-grained judgments of grammaticality and discourse coherence; therefore, linguistics-trained raters were recruited, following standard practice in acceptability-judgment and experimental syntax research [41,42].
Stage 1: Independent Evaluation
Raters completed an online survey assessing whether deleting pronouns resulted in sentences that remained natural and interpretable. Each item included one version of a Winograd schema with two candidate antecedents. Participants selected the most plausible referent for the omitted element (Ø) and rated the sentence’s naturalness on a five-point Likert scale, following common practice in acceptability-rating [43]. Before the main evaluation, raters completed a brief familiarization session with ten practice items. Each participant then reviewed a randomized subset of the 240 sentences to reduce fatigue and avoid repetition; minimal-pair versions were never shown to the same rater, consistent with experimental design principles for ambiguity judgments.
The difficulty levels of the sentences were categorized based on their naturalness ratings, as assessed by human participants during the evaluation. The naturalness-based difficulty levels are as follows:
  • Easy Items: Naturalness rating ≥ 4.0;
  • Moderate Items: 3.0 ≤ Naturalness rating < 4.0;
  • Hard Items: Naturalness rating < 3.0.
The distribution of naturalness ratings across difficulty levels is shown in Table 3, including the mean and standard deviation for each group.
Additionally, a Kruskal–Wallis test was conducted to determine if there were significant differences between difficulty levels. This non-parametric method was chosen as it is suitable for ordinal data with non-normal distribution [44]. The results showed a significant difference (H = 140.23, p = 3.55 × 10−31), indicating that sentence complexity impacted the perceived naturalness ratings.
Stage 2: Agreement and Adjudication
After all responses were collected, inter-rater reliability was assessed using Cohen’s κ, a widely used measure for categorical annotation [45]. The resulting κ = 0.87 reflects near-perfect agreement among raters in identifying the intended referent of zero anaphora, indicating that the task elicits stable human interpretations and is not confounded by uncontrolled ambiguity. Although overall agreement was high, items receiving low naturalness ratings (below 3) or inconsistent referential interpretations were flagged for further review. These were re-examined in a consensus meeting, following standard adjudication procedures in corpus annotation [46], and minor adjustments were made to improve syntactic clarity or discourse coherence.
Through this iterative validation, the final dataset was confirmed to be grammatical, natural, and balanced in difficulty, providing a reliable benchmark for evaluating large language models’ reasoning and reference resolution in Chinese discourse.

3.3. Human Upper Bound

Building on the validation stage, which establishes the linguistic stability of the stimulus set, the next step was to determine how well human speakers perform on the finalized dataset. The human upper-bound evaluation was therefore conducted only after the dataset was confirmed to be natural and interpretable, allowing the resulting accuracy measures to serve as a reliable performance ceiling grounded in a fully validated benchmark.
To establish this performance ceiling for comparison with LLMs, a human evaluation was conducted following widely adopted protocols for Winograd-style and referential ambiguity tasks [47]. Twenty-four native Chinese speakers (aged 19–32) participated voluntarily (While the sample size of 24 participants may seem small given the 240 items, similar studies in the literature have successfully used comparable sample sizes in Winograd-style ambiguity tasks [21,48]. Additionally, our task design ensures that each of the 240 sentences was evaluated multiple times by the 24 participants, which maximizes the reliability of the data). The participant pool included both linguistics and non-linguistics majors so that the upper bound reflects general native-speaker interpretability. All participants were balanced across four lists, each containing 60 Winograd-style sentences (240 unique sentences in total), using counterbalancing practices common in psycholinguistic experiments [42]. Each sentence appeared in only one version (A or B) per participant to avoid learning or memory effects, and item order was pseudo-randomized. All participants completed the task individually under identical conditions.
Participants were instructed to select which of the two-candidate noun phrases the zero anaphor (Ø) most plausibly referred to, consistent with standard procedures in human evaluation of coreference and anaphora [49]. Accuracy was computed against author-defined gold labels. Human performance was assessed through subject-level accuracy, item-level accuracy, and overall micro accuracy, following common evaluation frameworks in WSC and coreference research [50].
To further assess the reliability of human performance, Fleiss’ Kappa was calculated to evaluate the consistency among participants [51]. The resulting κ = 0.803, with a 95% confidence interval (CI) of [0.758, 0.846], indicates good agreement among the participants. This suggests that the evaluation task was stable and that the naturalness of the stimuli was not confounded by uncontrolled ambiguity. The p-value of 0.493 further supports the stability of the observed agreement. This high consistency ensures that the human upper-bound measures provide a reliable ceiling for model performance.
As shown in Table 4, the mean subject-level accuracy across participants was 0.926, with narrow confidence intervals, indicating stable performance and low inter-individual variability. Item-level accuracy also averaged 0.926, but with greater variability, reflecting meaningful differences in discourse complexity and commonsense reasoning demands, as noted in prior WSC analyses.
Aggregating all 1440 responses yielded a micro accuracy of 0.926, representing the overall human upper bound for this benchmark. The consistency of these results confirms both the naturalness of the stimuli and the balanced difficulty across the four experimental lists. The derived human upper bound provides an interpretable and empirically grounded reference point for evaluating symmetry and asymmetry in LLMs’ discourse-level reasoning and zero anaphora resolution performance in Chinese.

4. Experimental Setup

This study comprises two complementary experiments based on the WSC-ZA benchmark. The first experiment evaluates the accuracy of LLMs in resolving zero anaphora in Chinese Winograd-style sentences. The second experiment provides a qualitative chain-of-thought (CoT) error analysis to diagnose the linguistic and reasoning factors behind incorrect predictions, revealing where model behavior becomes asymmetric despite symmetric input design.

4.1. First Experiment: Quantitative Evaluation

4.1.1. Models Evaluated

Three representative LLM families were selected to reflect international, Chinese-adapted, and open-source systems: GPT-4 (OpenAI), ChatGLM-4 (ZhipuAI), and LLaMA-3 (Meta). GPT-4 is a general-purpose, international model known for its strong cross-lingual reasoning performance, with access provided via the OpenAI API. ChatGLM-4 is optimized for Chinese syntax and discourse patterns, accessible both through an API and locally [52]. LLaMA-3, an open-source model, is recognized for its transparent, research-friendly architecture, with local access [53]. Deterministic decoding (temperature = 0) was used for all models, following standard practice in controlled LLM evaluation [54,55]. Full model specifications are provided in Appendix A.

4.1.2. Prompting Paradigms

Three prompting paradigms were tested: zero-shot, one-shot, and few-shot prompting. These settings follow well-established methods in LLM evaluation regarding anaphora resolution [54,56]. Zero-shot prompting assesses a model’s intrinsic reasoning ability without exposure to any examples [55,57]. One-shot prompting evaluates whether a single demonstration can guide the model toward the correct referential pattern [58]. Few-shot prompting tests the degree to which multiple examples support in-context learning and improve generalization to new zero-anaphora cases [59]. Full templates and examples are provided in Appendix B.
  • Zero-shot Prompting: Test cases were presented as prompts without providing the model with any prior examples.
  • One-shot prompting: Test cases were preceded by one fixed example illustrating a zero anaphora instance and its correct resolution.
  • Few-shot prompting: Test cases were preceded by three input–output examples to show the model how the task works.
To address potential concerns about template length and context-window limits, we estimated token counts for all prompting templates using a standard tokenizer. For a representative 40-character Chinese target sentence, the approximate prompt lengths were: zero-shot ≈190 tokens, one-shot ≈280 tokens, and few-shot ≈400 tokens. All values fall well below the context-window capacities of GPT-4, ChatGLM-4.6, and LLaMA-3, and no truncation or overflow occurred during inference. Importantly, all template, including demonstrations in one-shot and few-shot conditions, were held constant across items, ensuring that differences across prompting modes arise solely from the amount of contextual examples rather than unintended prompt-length asymmetries. These three prompting modes allow us to compare model performance symmetrically across identical benchmark items, varying only the information provided in the prompt.

4.1.3. Evaluation Metrics

The dataset WSC-ZA used in this study consists of 240 sentences, with a balanced distribution of gold labels: 120 sentences (50%) have the correct answer as A, and 120 sentences (50%) have the correct answer as B. This balanced distribution ensures a fair evaluation of the model’s performance. Accuracy-based metrics were employed as the primary evaluation metric because each WSC-ZA sentence contains one zero anaphor and two candidate antecedents, mirroring standard WSC evaluation practice [47].
  • Micro Accuracy
Definition: The proportion of correct predictions out of 240.
Purpose: Overall model performance.
  • Item-level Accuracy
Definition: The proportion of correct predictions for each item, aggregated across models/modes.
Purpose: Detect hard/ambiguous sentences.
  • Mode-level Accuracy
Definition: Accuracy under zero-shot, one-shot, few-shot prompting conditions.
Purpose: Measure benefits of in-context learning.
In addition to accuracy-based metrics, we also report macro-averaged Precision, Recall, and F1-Score to provide a more complete evaluation, since macro-averaging weights classes (or predefined item subsets) equally and supports symmetric comparisons across subsets (e.g., commonsense types) [60]. These metrics are defined as follows:
  • Precision
Definition: Precision is computed separately for each candidate label (A and B) as the proportion of correctly predicted instances of that label out of all instances predicted as that label. The macro-averaged Precision is the average of these Precision scores across both labels.
P r e c i s i o n m a c r o = 1 l i = 1 l T P i T P i + F P i
Purpose: Measures how accurately the model selects antecedents for each candidate label, preventing bias toward either option in a forced-choice setting.
  • Recall
Definition: Recall is computed separately for each candidate label (A and B) as the proportion of correctly instances of that label out of all true instances of that label. The macro-averaged Recall is the average of these Recall scores across both labels.
R e c a l l m a c r o = 1 l i = 1 l T P i T P i + F N i
Purpose: Indicates how well the model identifies the correct antecedent for each candidate option, even if it selects incorrect ones.
  • F1-Score
Definition: The F1-Score is the harmonic mean of Precision and Recall for each candidate label. The macro-averaged F1-Score is the average of these F1-Scores across both labels.
F 1 m a c r o = 2 · P r e c i s i o n m a c r o · R e c a l l m a c r o P r e c i s i o n m a c r o + R e c a l l m a c r o
Purpose: Provides a balanced measure of the model’s performance, accounting for both precision and recall, making it suitable for evaluating forced-choice disambiguation tasks.

4.2. Second Experiment: Chain-of-Thought Error Analysis

Chain-of-Thought (CoT) a widely used method for interpreting LLM reasoning processes [61,62]. By re-prompting the model to articulate its step-by-step reasoning, this method makes it possible to diagnose whether an incorrect prediction stems from grammatical misanalysis, missing commonsense knowledge, or failures in discourse tracking.
The workflow summary is as below:
  • Identify errors: Extract all items where the model’s prediction differed from the gold answer.
  • Apply CoT diagnostic prompt: Re-run each misclassified item using a standardized CoT template, which instructs the model to re-evaluate the sentence, compare the antecedent candidates, and explain why the previous answer was incorrect. The full CoT template is provided in Appendix C.
  • Generate explanations: Obtain step-by-step reasoning traces from all three models.
  • Classify error types: Manually code each explanation to determine the source of error.
  • Analyze patterns: Summarize error distributions and examine representative cases.

5. Results

5.1. Model Performance Comparison

We evaluated three large language models (GPT-4, ChatGLM-4, and LLaMA-3) on 240 Chinese zero-anaphora resolution items across three prompting conditions: zero-shot, one-shot, and few-shot. The results are summarized in Table 5 and Table 6. Table 5 provides the overall accuracy for each model, while Table 6 breaks down the models’ performance in terms of macro-averaged Precision, Recall, and F1-Score.
ChatGLM-4 shows the highest overall accuracy across all conditions, with 0.921 in the zero-shot condition and 0.929 in both one-shot and few-shot conditions (Table 5). This strong and consistent performance is further supported by its high macro-averaged Precision (0.913–0.933), Recall (0.913–0.929), and F1-Score (0.912–0.929) in Table 6, indicating that ChatGLM-4 is highly responsive to in-context cues and well-adapted to handling Chinese zero anaphora.
GPT-4 achieves moderate accuracy with 0.883 in zero-shot, slightly improving to 0.892 in one-shot, but returning to 0.883 in few-shot (Table 5). Despite this stable accuracy, its macro-averaged Precision (0.883–0.892) and Recall (0.883–0.892) in Table 6 reveal limited improvement across conditions, suggesting that GPT-4 is already well-aligned with the structural properties of Chinese zero anaphora resolution. Additional examples do not significantly enhance its performance, reflecting a relatively stable but not fully symmetric sensitivity to prompting.
LLaMA-3 shows the most variation in performance. Its accuracy starts at 0.849 in zero-shot, drops to 0.792 in one-shot, and partially recovers to 0.824 in few-shot (Table 5). This drop is reflected in its lower macro-averaged Precision (0.779–0.850) and Recall (0.779–0.846) values in Table 6, suggesting that LLaMA-3 is more sensitive to prompt formulation and may rely more on surface-level heuristics, leading to inconsistent responses.
In summary, ChatGLM-4 outperforms GPT-4 and LLaMA-3 in both overall accuracy and detailed performance metrics, showing the highest stability and reasoning ability across all conditions. GPT-4 performs reasonably well but shows minimal improvement with additional examples, while LLaMA-3 performs the weakest, especially in one-shot and few-shot conditions, where its performance declines further.

5.2. Comparison with Human Upper Bound

The Human Upper Bound was established through native speaker evaluation, yielding an accuracy of 0.926 for overall performance, as detailed in Section 3.3. When compared to this upper bound, ChatGLM-4 performed the closest, with its best result of 0.929 under both the one-shot and few-shot conditions. GPT-4 achieved a peak accuracy of 0.892, while LLaMA-3 lagged with a maximum accuracy of 0.824.
This comparison highlights the substantial gap between the best-performing model, ChatGLM-4, and human reasoning capabilities. Despite some progress, LLMs still face significant challenges in fully capturing the semantic and discourse-level reasoning required for accurate interpretation of zero anaphora, particularly in more complex contexts that require deeper commonsense knowledge.

5.3. Item-Level Accuracy Results

To assess the overall robustness of the models, we analyzed their performance across all 240 test items, considering both item-level accuracy and overall consistency. We categorized the items into easy, moderate, and hard difficulty levels based on the accuracy achieved by human participants in the human upper bound evaluation. Performance-based difficulty (accuracy) is used here instead of naturalness-based difficulty, as described in Section 3.2, because it more reliably assesses model performance after ensuring that sentences meet a naturalness threshold [63]. This transition ensures that only validated sentences proceed to performance testing, in line with standard NLP evaluation practices [64]. The performance-based difficulty levels are as follows:
  • Easy Items: Accuracy ≥ 0.85;
  • Moderate Items: 0.60 ≤ Accuracy < 0.85;
  • Hard Items: Accuracy < 0.60.

5.3.1. Performance by Difficulty Level

The bar chart in Figure 1 aggregates the models’ performance across these three difficulty levels. It shows the average accuracy for each model within each category, illustrating how well they performed on tasks of varying complexity:
  • GPT-4 shows strong performance on easy items (0.956), with a slight decline to 0.809 on moderate items. However, it experiences a sharp decline in performance on hard items, with accuracy dropping to 0.139. This significant drop indicates that GPT-4 struggles the most with more complex cases, particularly on hard items.
  • ChatGLM-4 performs best on easy items (0.969), with a slight drop to 0.887 on moderate items. Similarly to GPT-4, ChatGLM-4 also shows a sharp decline on hard items, with accuracy dropping to 0.444. While the decline from easy to hard is steep, it is less pronounced than GPT-4′s performance on hard items.
  • LLaMA-3 demonstrates the highest accuracy on both moderate items (0.681) and hard items (0.528) compared to the other models, suggesting that it handles more complex cases better. However, its performance on easy items (0.878) is lower than that of GPT-4 and ChatGLM-4, indicating that it is less effective on simpler tasks.
Figure 1. Distribution of item-level accuracy across models.
Figure 1. Distribution of item-level accuracy across models.
Symmetry 18 00047 g001
Overall, all three models displayed a decline from easy to hard items, but in different asymmetric patterns, indicating that item difficulty interacts with model-specific reasoning strategies. However, the Kruskal–Wallis test revealed that the performance differences across the three difficulty levels (easy, moderate, and hard) were not statistically significant (p = 0.202). This suggests that while there are observable variations in performance, these differences are not statistically significant and likely reflect the inherent challenges of more complex tasks. The 95% confidence intervals for each difficulty level further confirm the stability of the models’ performance, with no significant shifts in accuracy across different difficulty levels for the models tested.

5.3.2. Item-Level Performance Variations

The heatmap in Figure 2 illustrates individual item-level accuracy across all 240 test items. The blue-white-red color gradient in the color bar represents “Item-Level Accuracy” from 0 to 1, with blue shades indicating lower accuracy (close to 0.0) and red shades indicating higher accuracy (close to 1.0). Overall, the overlap of blue patches suggests that the models make similar errors on specific items, particularly those requiring advanced reasoning or greater ambiguity.
  • ChatGLM-4 is the most consistent model, demonstrating high accuracy across most items, though it still encounters occasional challenges on particularly difficult items, as reflected in sporadic blue patches.
  • GPT-4 performs well on easy items but exhibits variations on harder items. Its error distribution is similar to LLaMA-3, suggesting both models struggle with more complex cases.
  • LLaMA-3 displays more variability, particularly on hard items, where it struggles the most, as indicated by frequent blue patches. It has the lowest overall accuracy among the models, with more mistakes on challenging items compared to ChatGLM-4 and GPT-4.
To assess the statistical significance of the performance differences between the models, we conducted paired t-tests across the three models. The results of these tests are summarized in Table 7. These results indicate statistically significant performance differences between the models, especially between LLaMA-3 and the other two models (GPT-4 and ChatGLM-4). These differences highlight that LLaMA-3 outperforms the other models in tasks requiring more complex reasoning, suggesting that it is better equipped to engage in asymmetric reasoning. The p-values below 0.05 confirm that the performance differences between the models are statistically significant, suggesting that the observed variations in accuracy are meaningful and not due to random chance.
In summary, while all models show strong performance on easy items, they experience a decline in accuracy as task difficulty increases. LLaMA-3 demonstrates the highest robustness on moderate and hard items, indicating it is better suited for complex tasks. However, it performs less effectively on easy items. These findings underscore the effectiveness of the WSC-ZA benchmark in testing the models’ ability to resolve zero anaphora across both easy and difficult contexts, providing insight into their relative strengths and weaknesses.

5.4. Error Overlap Analysis

The Venn diagram in Figure 3 illustrates the overlap in errors made by the three models, showing both the total number of misinterpreted sentences for each model and the extent to which these errors are shared.
The center of the diagram shows 19 sentences misinterpreted by all three models, representing a consistently difficult subset of items that posed challenges regardless of model architecture or prompting condition. The intersections between two circles correspond to sentences incorrectly classified by exactly two models. For example, GPT-4 and LLaMA-3 share 37 errors, indicating that these models converge on similar failure points for a substantial portion of the data. Such overlaps suggest that certain discourse-level ambiguities or structural complexities systematically challenge multiple models. The non-overlapping regions represent unique errors, such as those made exclusively by GPT-4 (7 sentences). These idiosyncratic errors may reflect model-specific reasoning tendencies or sensitivity to particular linguistic cues.
Overall, this analysis highlights both the shared weaknesses and model-specific limitations that are not apparent from aggregate accuracy metrics alone. A detailed qualitative analysis of these 19 universally misinterpreted sentences is provided in Section 6.4, where we discuss the error types and the underlying discourse and pragmatic factors contributing to model failures.

6. Discussion

6.1. Model-Level Insights and Implications

The comparative performance patterns across three models reveal important characteristics of LLM sensitivity to Chinese zero anaphora, highlighting distinct patterns in their reasoning abilities and performance under varying task complexities.
ChatGLM-4 consistently outperforms the other models in both overall accuracy and detailed metrics such as macro-averaged Precision, Recall, and F1-Score. This aligns with previous findings showing that models trained with Chinese-specific corpora often capture topic-driven patterns more effectively [65]. However, its sharp decline on moderate and hard items indicates persistent weaknesses in discourse reasoning. This reflects an asymmetry between strong performance under symmetric, surface-aligned conditions and weaker performance when deeper discourse cues are required. Such findings are consistent with prior work showing that zero anaphora resolution requires the deep integration of syntactic, semantic, and discourse-level cues, an ability that remains limited in current LLMs [66]. Current LLMs, while capable of leveraging surface-level lexical and topic patterns, still exhibit fundamental limitations in tracking discourse entities and modeling their accessibility across complex syntactic and semantic contexts [67].
GPT-4 showed stable performance across prompting conditions, with only marginal improvement from in-context learning. This limited improvement may reflect insensitivity to demonstrations, rather than simply indicating stability. This suggests that GPT-4 may have already reached a performance ceiling in zero-shot or one-shot. This observation aligns with recent studies reporting that demonstration-based prompting yields diminishing returns when the model already possesses strong pretrained representations [54,56]. Its seven unique errors also align with observations that general-purpose LLMs may apply broader semantic or world-knowledge priors that sometimes override context-anchored interpretations [68], leading to asymmetric interpretations even though the input conditions were controlled for structural symmetry.
LLaMA-3 demonstrated the highest variability and sensitivity to prompt format, consistent with reports that open-source models display greater instability across prompting conditions [69]. Its stronger performance on moderate and hard items highlights its potential in handling more complex tasks, despite weaker performance on straightforward syntactic cues. Similar patterns have been noted in multilingual LLMs where deeper embeddings sometimes support harder inference cases but introduce noise for simpler ones [32]. This again shows model-dependent asymmetry in how structurally symmetric WSC-style stimuli reveal differences in reasoning behavior.

6.2. Interpreting the Gap Between LLMs and the Human Upper Bound

Despite achieving human-level accuracy on easy items, all three models fell well below the human upper bound on more complex items, echoing long-standing findings that zero anaphora uniquely tests topic continuity, discourse coherence, and pragmatic inference [9,40]. Humans maintain discourse topics through global coherence relations, while LLMs rely primarily on surface distributional cues, a limitation widely discussed in recent NLP research [70].
Zero anaphora often relies on non-syntactic factors, including: pragmatic inferencing, discourse coherence, semantic role persistence and commonsense knowledge [18]. These factors introduce asymmetric reasoning demands that extend beyond structural cues alone. The substantial human–model gap observed here suggests that LLM improvements in syntax and semantics alone are insufficient without explicit modeling of implicit clues like pragmatic inference and commonsense reasoning [71]. This gap also aligns with broader findings in multilingual and contrastive learning research: even when inputs are structurally symmetric, LLMs may still exhibit markedly asymmetric reasoning [11,36]. This reinforces the value of symmetric minimal-pair benchmarks, which reveal reasoning weaknesses that remain hidden in conventional corpora and help explain why LLM–human gaps persist on complex discourse phenomena.

6.3. Effects of Item Difficulty on Model Performance

The sharp performance declines across moderate and hard items confirm prior findings that discourse-level reasoning is a core limiting factor for LLMs [4]. Zero anaphora requires maintaining referential links across clause boundaries, a feature emphasized in functional descriptions of Chinese as a topic-prominent language [2,40]. These results align with work showing that LLMs over-rely on local dependency heuristics and struggle when these heuristics misalign with pragmatic and discourse-level interpretations [37].
LLaMA-3′s relative strength on hard items mirrors the observation that larger contextual embeddings may help track broader discourse arcs, though inconsistently [53]. This performance pattern suggests that LLaMA-3 adopts a more cautious approach when faced with higher uncertainty, particularly on more difficult tasks. Unlike GPT-4 and ChatGLM-4, which show a sharp decline in accuracy on harder items, LLaMA-3 maintains relatively stable performance on moderate and hard tasks. This may reflect its tendency to make more conservative choices in ambiguous situations, which enables it to perform better on complex tasks, but at the cost of lower accuracy on simpler ones [53]. Meanwhile, ChatGLM-4′s superior performance on easy items but inconsistent performance on more complex cases illustrates a form of reasoning asymmetry, where models excel under symmetric, surface-based conditions yet falter when deeper inferential reasoning is required [72].

6.4. Systematic Error Patterns

To explain why all three models failed on the same 19 sentences, we developed a seven-dimensional error taxonomy based on linguistic and computational research [2,66,73,74]. The taxonomy was derived from systematic coding of CoT explanations generated in Experiment 2, pinpointing specific linguistic and reasoning failures behind each error.
As shown in Table 8, the most frequent error type was surface-cue misguidance. The models were often misled by linear proximity, lexical salience, or subject biases, indicating an overreliance on shallow heuristics [18]. Semantic role conflict was the second most common error, with models struggling to distinguish roles such as agent vs. experiencer and container vs. content. This echoes long-standing challenges in Chinese anaphora resolution [75,76]. Pragmatic and commonsense reasoning errors also played a significant role, requiring inference in areas like spatial, temporal, and psychological reasoning [68,77]. Additionally, discourse-structural issues, including topic chain maintenance and logical connective misinterpretation, contributed to errors, consistent with previous discourse relation processing challenges [37,78]. Long-distance dependencies and syntactic ambiguity also led to misinterpretations, highlighting persistent challenges for LLMs [79].
To illustrate how the seven-dimensional taxonomy aligns with model reasoning, we provide a representative example in which the CoT trace reveals a clear misinterpretation of a logical connective. Consider sentence in (11).
(11)
建国得到了免费的歌剧票,但他给了李广,因为Ø特别想看。
George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it. Who was particularly eager to see it?
A 建国 George
B 李广 Eric
Correct Answer: B 李广 Eric
All three models predicted the wrong antecedent A. The CoT trace produced by GPT-4 illustrates how this error arises:
(1) Step-by-step reasoning:
“建国得到了票” is the initial event.
“他给了李广” is interpreted as a generous act by 建国.
The model incorrectly assumes that “因为Ø特别想看” must refer to 建国’s motivation for giving away the ticket.
It thus concludes that Ø = 建国.
(2) Model output:
Answer: A
(3) Short explanation:
The model overlooks the scope of the causal connective 因为 and fails to link it to the recipient 李广 rather than the giver 建国.
In effect, the CoT reconstructs the causal relation as “建国得到了票,因为建国想看,” failing to recognize that the clause actually encodes 李广’s desire as the motivation for transferring the ticket. This reflects a recurring pattern among the 19 hard-error items: when but–because structures appear in sequence, LLMs often assign causal scope to the wrong argument. This example also shows how a specific error type (misinterpretation of logical connectives) maps directly onto identifiable failures in the model’s step-by-step reasoning. The correspondence between the seven error dimensions and the CoT traces indicates that the taxonomy captures not only observable errors but also the underlying inference mechanisms driving these failures.
These error patterns demonstrate that the 19 misinterpreted sentences intersect syntax, semantics, and pragmatics, which are all critical to Chinese zero-anaphora resolution. In these cases, LLMs struggle most on items where deeper discourse cues are required, despite the benchmark’s structurally symmetric design. Taken together, these error types confirm that WSC-ZA’s symmetric construction does not trivialize the task; rather, it makes it possible to pinpoint the loci of asymmetric reasoning failures in LLMs, in line with recent symmetry/asymmetry analyses in representation learning and relation extraction [11,20,36]. This underscores the need for models with more robust discourse reasoning capabilities.

7. Conclusions

This study introduced WSC-ZA, a benchmark that adapts Winograd schemas to zero anaphora based on the linguistic characteristics of Chinese. The key innovation of this benchmark lies in its integration of several components: the systematic use of a symmetric minimal-pair design, the adaptation of Winograd schemas to Chinese zero pronouns, native-speaker validation, and the establishment of a robust human upper bound. Using this resource, which isolates discourse-level reasoning from superficial syntactic cues, we evaluated three representative LLMs. While all models performed well on straightforward cases, none approach human reliability on items that require integrating topic continuity, semantic-role interpretation, and pragmatic inference. Although these patterns suggest challenges for current models in maintaining stable discourse-level reasoning in a topic-prominent language such as Chinese, we acknowledge that factors such as constructional rarity, variation in naturalness, or sensitivity to prompt design may also contribute to the observed variability.
Beyond providing a new evaluation resource, the study contributes a seven-dimensional error taxonomy that captures core linguistic and cognitive factors driving model failures. The taxonomy reveals that misleading surface cues, semantic-role ambiguity, and deficits in commonsense-based pragmatic reasoning are the dominant sources of error, alongside challenges involving topic-chain maintenance, logical connectives, long-distance dependencies, and syntactic ambiguity. These findings broadly support the view that next-token prediction alone may be insufficient for robust zero-anaphora resolution, which typically requires some degree of discourse tracking and pragmatic inference, though further analysis is needed to determine whether the observed limitations reflect genuine reasoning challenges or artifacts of task design or prompting.
Several limitations warrant future investigation. The benchmark reflects controlled sentence-level contexts and a binary-choice format; broader discourse settings and richer anaphoric phenomena remain to be explored. The evaluation also focused on three model families and prompting-based inference, leaving open how training-time interventions, such as fine-tuning on discourse data, or discourse-aware architectures might alter performance. Given the potential influence of stimulus design and input factors, the present results should be interpreted with caution. Future work may expand WSC-ZA to longer discourse contexts and incorporate multimodal or knowledge-grounded cues, helping clarify whether the asymmetric patterns observed here stem from deeper reasoning challenges or from interactions between models and the evaluation format. Similarly, fine-tuning LLMs on discourse data may improve performance on discourse reasoning tasks, offering a promising direction for enhancing model sensitivity to complex discourse structures. Additionally, integrating neuro-symbolic or discourse-structured modeling approaches may help capture topic structure, event coherence, and pragmatic inference more reliably.
Overall, the findings highlight both the promise and the current limitations of LLMs for Chinese zero anaphora. While progress in surface-aligned settings is evident, achieving human-level stability under conditions designed to test symmetric and asymmetric reasoning, remains an open challenge. However, WSC-ZA provides a foundation for future work toward models capable of more principled discourse reasoning in Chinese and other pro-drop languages.

Author Contributions

Conceptualization, S.C.; methodology, Y.Q. and Z.L.; validation, X.C.; data curation, Z.L.; writing—original draft preparation, Z.L. and Y.Q.; writing—review and editing, S.C.; visualization, X.C.; supervision, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Municipal Fund for Philosophy and Social Sciences under grant number 2022EYY001; and the Shanghai Pujiang Program under grant number 22PJC016.

Data Availability Statement

The data supporting the reported results can be made available upon reasonable request. Please contact the corresponding author for further details.

Acknowledgments

The authors would like to thank the reviewers for their constructive and insightful comments, which have substantially improved the clarity and quality of this manuscript. We also acknowledge the use of GPT-5.2 tool for proofreading the entire manuscript during the preparation of this study. The authors have reviewed and edited the AI-generated output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Model Specifications Table
This appendix provides full model specifications, including parameter sizes and implementation details, to ensure reproducibility.
Table A1. Detailed model specifications.
Table A1. Detailed model specifications.
ModelVariant/
Parameters
Training
Background
Inference Setup
GPT-4Not publicly disclosed (estimated >1T parameters, mixture-of-experts)Multilingual, broad-domain, large-scale web + curated corporaOpenAI API; temperature = 0; max_tokens = 128
ChatGLM-4.66B parametersChinese-centric; dialog + web text; optimized for Chinese syntax and semanticsAPI/local; temperature = 0; top_p = 0.9; max_tokens = 128
LLaMA-370B open-source variantsTrained on 15T multilingual tokensLocal HF or llama.cpp; temperature = 0

Appendix B

Full Prompt Templates and Examples
This appendix provides the complete templates for all prompting paradigms, followed by representative examples.
  • Zero-Shot Template
You are a coreference resolution expert.
Determine who the zero pronoun (Ø) refers to in the following sentence.
Sentence:
Options:
A. {optionA}
B. {optionB}
Answer only with the following format:
Answer: A or B
  • One-Shot Template
You are a coreference resolution expert.
First, see one labeled example. Then resolve the target sentence.
[One-shot Example]
Sentence: 老生欺负新生,所以Ø得到了我们的帮助。谁得到了我们的帮助?
Options:
A. 老生
B. 新生
Correct Answer: B. 新生
[Target Item]
Sentence: {sentence}
Options:
A. {optionA}
B. {optionB}
Answer only with the following format:
Answer: A or B
  • Few-Shot Template
You are a coreference resolution expert.
See three labeled examples. Then resolve the target sentence.
[Example 1]
Sentence: 小明把椅子拉到钢琴前,但Ø是坏的。什么是坏的?
Options:
A.椅子
B.钢琴
Answer: B.钢琴
[Example 2]
Sentence: 狐狸们晚上钻进来吃小鸡,Ø越来越惊慌。谁越来越惊慌?
Options:
A. 狐狸们
B. 小鸡
Answer: B. 小鸡
[Example 3]
Sentence: 老生欺负新生,所以Ø得到了我们的帮助。谁得到了我们的帮助?
Options:
A. 老生
B. 新生
Correct Answer: B. 新生
[Target Item]
Sentence: {sentence}
Options:
A. {optionA}
B. {optionB}
Answer only with the following format:
Answer: A or B

Appendix C

CoT Prompt Template
Use the following CoT prompt format for each misclassified sentence:
You are a reasoning expert in Chinese discourse analysis.
The model made an incorrect prediction before. Now re-think the problem carefully step by step.
Sentence: {sentence}
Candidates:
A. {optionA}
B. {optionB}
Please reason step-by-step to decide who the zero pronoun (Ø) refers to and briefly explain why the previous model prediction was wrong.
Output format:
1. Step-by-step reasoning
2. Correct answer (A/B)
3. Short explanation in English

References

  1. Hartmann, R.R.K.; Stork, F.C. Dictionary of Language and Linguistics; John Wiley & Sons: Hoboken, NJ, USA, 1972. [Google Scholar]
  2. Huang, Y. Anaphora: A Cross-Linguistic Approach; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  3. Konno, R.; Kiyono, S.; Matsubayashi, Y.; Ouchi, H.; Inui, K. Pseudo zero pronoun resolution improves zero anaphora resolution. arXiv 2021, arXiv:2104.07425. [Google Scholar] [CrossRef]
  4. Wei, T.; Li, J.; Ye, X.; Qu, W. Hierarchical Discourse-Semantic Modeling for Zero Pronoun Resolution in Chinese. Big Data Cogn. Comput. 2025, 9, 234. [Google Scholar] [CrossRef]
  5. Di Eugenio, B. Centering theory and the Italian pronominal system. In Proceedings of the COLING 1990 Volume 2: Papers Presented to the 13th International Conference on Computational Linguistics, Helsinki, Finland, 20–25 August 1990. [Google Scholar]
  6. Ferrández, A.; Palomar, M.; Moreno, L. An empirical approach to Spanish anaphora resolution. Mach. Transl. 1999, 14, 191–216. [Google Scholar] [CrossRef]
  7. Iida, R.; Inui, K.; Matsumoto, Y. Exploiting syntactic patterns as clues in zero-anaphora resolution. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL-ACL’06, Sydney, Australia, 17–21 July 2006. [Google Scholar]
  8. Kim, Y.; Ra, D.; Lim, S. Zero-anaphora resolution in Korean based on deep language representation model: BERT. ETRI J. 2021, 43, 299–312. [Google Scholar] [CrossRef]
  9. Pu, M.-M. Zero anaphora and topic chain in Chinese discourse. In The Routledge Handbook of Chinese Discourse Analysis; Routledge: London, UK, 2019; pp. 188–200. [Google Scholar]
  10. Tao, L.; Healy, A.F. Zero anaphora: Transfer of reference tracking strategies from Chinese to English. J. Psycholinguist. Res. 2005, 34, 99–131. [Google Scholar] [CrossRef]
  11. Meng, L.; Li, Y.; Wei, W.; Yang, C. Resolving Linguistic Asymmetry: Forging Symmetric Multilingual Embeddings Through Asymmetric Contrastive and Curriculum Learning. Symmetry 2025, 17, 1386. [Google Scholar] [CrossRef]
  12. Zhao, S.; Ng, H.T. Identification and resolution of Chinese zero pronouns: A machine learning approach. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 541–550. [Google Scholar]
  13. Kong, F.; Zhou, G. A tree kernel-based unified framework for Chinese zero anaphora resolution. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 882–891. [Google Scholar]
  14. Chen, C.; Ng, V. Chinese zero pronoun resolution: Some recent advances. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1360–1365. [Google Scholar]
  15. Yin, Q.; Zhang, W.; Zhang, Y.; Liu, T. A deep neural network for chinese zero pronoun resolution. arXiv 2016, arXiv:1604.05800. [Google Scholar] [CrossRef]
  16. Yin, Q.; Zhang, Y.; Zhang, W.; Liu, T. Chinese zero pronoun resolution with deep memory network. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1309–1318. [Google Scholar]
  17. Webster, K.; Recasens, M.; Axelrod, V.; Baldridge, J. Mind the GAP: A balanced corpus of gendered ambiguous pronouns. Trans. Assoc. Comput. Linguist. 2018, 6, 605–617. [Google Scholar] [CrossRef]
  18. Poesio, M.; Yu, J.; Paun, S.; Aloraini, A.; Lu, P.; Haber, J.; Cokal, D. Computational models of anaphora. Annu. Rev. Linguist. 2023, 9, 561–587. [Google Scholar] [CrossRef]
  19. Levesque, H.J. The winograd schema challenge. In Proceedings of the AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, Stanford, CA, USA, 21–23 March 2011. [Google Scholar]
  20. Tang, M.; Zhang, L.; Yu, Z.; Shi, X.; Liu, X. Symmetry-and Asymmetry-Aware Dual-Path Retrieval and In-Context Learning-Based LLM for Equipment Relation Extraction. Symmetry 2025, 17, 1647. [Google Scholar] [CrossRef]
  21. Bernard, T.; Han, T. Mandarinograd: A Chinese collection of Winograd schemas. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 21–26. [Google Scholar]
  22. Xu, L.; Hu, H.; Zhang, X.; Li, L.; Cao, C.; Li, Y.; Xu, Y.; Sun, K.; Yu, D.; Yu, C. CLUE: A Chinese language understanding evaluation benchmark. arXiv 2020, arXiv:2004.05986. [Google Scholar] [CrossRef]
  23. Yang, N.; Zhang, J.; Ma, L.; Lu, Z. A study of zero anaphora resolution in Chinese discourse: From the perspective of psycholinguistics. Front. Psychol. 2021, 12, 663168. [Google Scholar] [CrossRef]
  24. Bi, M.; Liu, X.; Zhang, Q.; Yang, Z. Machine reading comprehension combined with semantic dependency for Chinese zero pronoun resolution. Artif. Intell. Rev. 2023, 56, 7597–7612. [Google Scholar] [CrossRef]
  25. Chen, C.; Ng, V. Chinese zero pronoun resolution: A joint unsupervised discourse-aware model rivaling state-of-the-art resolvers. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Short Papers. Volume 2, pp. 320–326. [Google Scholar]
  26. Yin, Q.; Zhang, Y.; Zhang, W.; Liu, T.; Wang, W.Y. Zero pronoun resolution with attention-based neural network. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 13–23. [Google Scholar]
  27. Lin, P.; Yang, M. Hierarchical attention network with pairwise loss for Chinese zero pronoun resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 8352–8359. [Google Scholar]
  28. Liu, T.; Cui, Y.; Yin, Q.; Zhang, W.; Wang, S.; Hu, G. Generating and exploiting large-scale pseudo training data for zero pronoun resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 30 July–4 August 2017; Long Papers. Volume 1, pp. 102–111. [Google Scholar]
  29. Song, L.; Xu, K.; Zhang, Y.; Chen, J.; Yu, D. ZPR2: Joint zero pronoun recovery and resolution using multi-task learning and BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational linguistics, Online, 5–10 July 2020; pp. 5429–5434. [Google Scholar]
  30. Chen, S.; Gu, B.; Qu, J.; Li, Z.; Liu, A.; Zhao, L.; Chen, Z. Tackling zero pronoun resolution and non-zero coreference resolution jointly. In Proceedings of the 25th Conference on Computational Natural Language Learning, Online, 20–24 November 2021; pp. 518–527. [Google Scholar]
  31. Yang, J.; Xu, K.; Xu, J.; Li, S.; Gao, S.; Guo, J.; Wen, J.-R.; Xue, N. Transformer-GCRF: Recovering chinese dropped pronouns with general conditional random fields. arXiv 2020, arXiv:2010.03224. [Google Scholar] [CrossRef]
  32. Wang, L.; Liu, S.; Xu, M.; Song, L.; Shi, S.; Tu, Z. A survey on zero pronoun translation. arXiv 2023, arXiv:2305.10196. [Google Scholar] [CrossRef]
  33. Le, N.T.; Ritter, A. Are large language models robust coreference resolvers? arXiv 2023, arXiv:2305.14489. [Google Scholar] [CrossRef]
  34. Ueyama, A.; Kano, Y. Dialogue Response Generation Using Completion of Omitted Predicate Arguments Based on Zero Anaphora Resolution. In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Prague, Czechia, 11–15 September 2023; pp. 282–296. [Google Scholar]
  35. Levesque, H.J.; Davis, E.; Morgenstern, L. The Winograd schema challenge. KR 2012, 2012, 3. [Google Scholar]
  36. Huang, Y.; Zhu, S.; Liu, W.; Wang, J.; Wei, X. Addressing Asymmetry in Contrastive Learning: LLM-Driven Sentence Embeddings with Ranking and Label Smoothing. Symmetry 2025, 17, 646. [Google Scholar] [CrossRef]
  37. Chen, S. Resolving chinese anaphora with chatgpt. In Proceedings of the 2024 International Conference on Asian Language Processing (IALP), Hohhot, China, 4–6 August 2024; pp. 31–36. [Google Scholar]
  38. Ernst, T.; Wang, C. Object preposing in mandarin chinese. J. East Asian Linguist. 1995, 4, 235–260. [Google Scholar] [CrossRef]
  39. Hao, Y. Explicitation of personal pronoun subject in Chinese EFL majors’ translation: A case study of translation universals based on PACCEL-W corpus. J. Lang. Teach. Res. 2015, 6, 669. [Google Scholar] [CrossRef]
  40. Li, C.N.; Thompson, S.A. Mandarin Chinese: A Functional Reference Grammar; University of California Press: Berkeley, CA, USA, 1989. [Google Scholar]
  41. Schütze, C.T. The Empirical Base of Linguistics: Grammaticality Judgements and Linguistic Methodology; University of Chicago Press: Chicago, IL, USA, 1996. [Google Scholar]
  42. Sprouse, J.; Schütze, C.T.; Almeida, D. A comparison of informal and formal acceptability judgments using a random sample from Linguistic Inquiry 2001–2010. Lingua 2013, 134, 219–248. [Google Scholar] [CrossRef]
  43. Cowart, W. Experimental Syntax: Applying Objective Methods to Sentence Judgments; Sage Publications: Thousand Oaks, CA, USA, 1997. [Google Scholar]
  44. Brezina, V. Classical monofactorial (parametric and non-parametric) tests. In A Practical Handbook of Corpus Linguistics; Springer: Berlin/Heidelberg, Germany, 2021; pp. 473–503. [Google Scholar]
  45. Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
  46. Pustejovsky, J.; Stubbs, A. Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2012. [Google Scholar]
  47. Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 8732–8740. [Google Scholar]
  48. Davis, E.; Morgenstern, L.; Ortiz, C. Human Tests of Materials for the Winograd Schema Challenge; New York University: New York, NY, USA, 2016. Available online: https://cs.nyu.edu/faculty/davise/papers/WS2016SubjectTests.pdf (accessed on 21 December 2025).
  49. Kehler, A.; Kehler, A. Coherence, Reference, and the Theory of Grammar; CSLI Publications: Stanford, CA, USA, 2002; Volume 380. [Google Scholar]
  50. Emami, A.; De La Cruz, N.; Trischler, A.; Suleman, K.; Cheung, J.C.K. A knowledge hunting framework for common sense reasoning. arXiv 2018, arXiv:1810.01375. [Google Scholar] [CrossRef]
  51. Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
  52. Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Long papers. Volume 1, pp. 320–335. [Google Scholar]
  53. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  54. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
  55. Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
  56. Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? arXiv 2022, arXiv:2202.12837. [Google Scholar] [CrossRef]
  57. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  58. Stano, P.; Horák, A. Evaluating Prompt-Based and Fine-Tuned Approaches to Czech Anaphora Resolution. In Proceedings of the International Conference on Text, Speech, and Dialogue, Erlangen Nürnberg, Germany, 25–28 August 2025; pp. 190–202. [Google Scholar]
  59. Madotto, A.; Lin, Z.; Winata, G.I.; Fung, P. Few-shot bot: Prompt-based learning for dialogue systems. arXiv 2021, arXiv:2110.08118. [Google Scholar]
  60. Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  61. Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
  62. Nye, M.; Andreassen, A.J.; Gur-Ari, G.; Michalewski, H.; Austin, J.; Bieber, D.; Dohan, D.; Lewkowycz, A.; Bosma, M.; Luan, D. Show your work: Scratchpads for intermediate computation with language models. arXiv 2021, arXiv:2112.00114. [Google Scholar] [CrossRef]
  63. Novikova, J.; Dušek, O.; Rieser, V. RankME: Reliable human ratings for natural language generation. arXiv 2018, arXiv:1803.05928. [Google Scholar] [CrossRef]
  64. Schuff, H.; Vanderlyn, L.; Adel, H.; Vu, N.T. How to do human evaluation: A brief introduction to user studies in NLP. Nat. Lang. Eng. 2023, 29, 1199–1222. [Google Scholar] [CrossRef]
  65. Wu, L.; Wei, H.-R.; Lin, H.; Li, T.; Yang, B.; Huang, F.; Lu, W. Enhancing LLM language adaption through cross-lingual in-context pre-training. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4-9 November 2025; pp. 27140–27154. [Google Scholar]
  66. Poesio, M.; Stuckardt, R.; Versley, Y. Anaphora Resolution; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  67. Zhu, X.; Zhou, Z.; Charlow, S.; Frank, R. Do LLMs Understand Anaphora Accessibility? Soc. Comput. Linguist. 2025, 8, 36. [Google Scholar]
  68. Kocijan, V.; Davis, E.; Lukasiewicz, T.; Marcus, G.; Morgenstern, L. The defeat of the Winograd schema challenge. Artif. Intell. 2023, 325, 103971. [Google Scholar] [CrossRef]
  69. Rupprecht, J.; Ahnert, G.; Strohmaier, M. Prompt perturbations reveal human-like biases in llm survey responses. arXiv 2025, arXiv:2507.07188. [Google Scholar] [CrossRef]
  70. Mondorf, P.; Plank, B. Beyond accuracy: Evaluating the reasoning behavior of large Language models--A survey. arXiv 2024, arXiv:2404.01869. [Google Scholar]
  71. Jones, E.; Patrawala, A.; Steinhardt, J. Uncovering gaps in how humans and LLMs interpret subjective language. arXiv 2025, arXiv:2503.04113. [Google Scholar] [CrossRef]
  72. Sun, Y.; Zhang, C.; Wang, C.; Han, Y. MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining. Appl. Sci. 2024, 14, 12072. [Google Scholar] [CrossRef]
  73. Davis, E. Logical formalizations of commonsense reasoning: A survey. J. Artif. Intell. Res. 2017, 59, 651–723. [Google Scholar] [CrossRef]
  74. Zhang, H.; Zhao, X.; Song, Y. WinoWhy: A deep diagnosis of essential commonsense knowledge for answering Winograd schema challenge. arXiv 2020, arXiv:2005.05763. [Google Scholar] [CrossRef]
  75. Chen, H.-C.; Cheung, H.; Tang, S.L.; Wong, Y.T. Effects of antecedent order and semantic context on Chinese pronoun resolution. Mem. Cogn. 2000, 28, 427–438. [Google Scholar] [CrossRef]
  76. Prajapati, P.; Goyal, V.; Kaur, K. A detailed study on anaphora resolution system for asian languages. SN Comput. Sci. 2024, 5, 811. [Google Scholar] [CrossRef]
  77. Bian, N.; Han, X.; Sun, L.; Lin, H.; Lu, Y.; He, B.; Jiang, S.; Dong, B. Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv 2023, arXiv:2303.16421. [Google Scholar]
  78. Xu, X.; Zhou, X. Topic shift impairs pronoun resolution during sentence comprehension: Evidence from event-related potentials. Psychophysiology 2016, 53, 129–142. [Google Scholar] [CrossRef]
  79. Madge, C.; Purver, M.; Poesio, M. Referential ambiguity and clarification requests: Comparing human and LLM behaviour. In Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference, Suzhou, China, 9 November 2025; pp. 1–11. [Google Scholar]
Figure 2. Item-level accuracy heatmap.
Figure 2. Item-level accuracy heatmap.
Symmetry 18 00047 g002
Figure 3. Error overlap across models.
Figure 3. Error overlap across models.
Symmetry 18 00047 g003
Table 1. Key distinctions between OntoNotes zero pronouns and WSC-ZA minimal pairs.
Table 1. Key distinctions between OntoNotes zero pronouns and WSC-ZA minimal pairs.
DimensionOntoNotes Zero PronounsWSC-ZA Minimal Pairs
antecedent optionsusually singlealways two competing
ambiguity sourcenatural discoursecontrolled pragmatic alternation
inference demandrecoveryreasoning
difficulty natureuncontrolled, unevensystematic, maximal
task typeresolutioninference
Table 2. A summary of deleted and modified sentences of WSC-ZA.
Table 2. A summary of deleted and modified sentences of WSC-ZA.
SentencesNumberTypeExample
original sentences284--
sentences removed for grammatical reasons6objects of the ba-constructionsee example (4)
20most prenominal possessorssee example (5)
8direct-object pronouns of transitive verbssee example (6)
10others-
modified sentences2adding stance adverbialssee example (7)
2inserting copular elementssee example (8)
2rephrasing predicatessee example (9)
8a few prenominal possessorssee example (10)
Table 3. The distribution of naturalness ratings across difficulty levels.
Table 3. The distribution of naturalness ratings across difficulty levels.
Difficulty LevelMeanStandard Deviation
Easy4.730.44
Moderate3.050.47
Hard1.750.44
Table 4. Human evaluation metrics for WSC-ZA.
Table 4. Human evaluation metrics for WSC-ZA.
MetricMean
Accuracy
SDN95%CI Lower95%CI Upper
Subject-level Accuracy
(Human Upper Bound)
0.9260.037240.91060.9422
Item-level Accuracy0.9260.1652400.90540.9474
Overall Micro Accuracy0.926-1440--
Table 5. Overall model performance.
Table 5. Overall model performance.
ModelZero-ShotOne-ShotFew-Shot
GPT-40.8830.8920.883
ChatGLM-40.9210.9290.929
LLaMA-30.8490.7920.824
Table 6. Macro-averaged Precision, recall, and F1-Score comparison across models and prompting condition (The reported values are rounded to three decimal places for simplicity. Some values may have subtle differences beyond the three decimal place that are not visible in the table. For example, GPT-4′s one-shot metrics differ starting at the fourth decimal place, LLaMA-3′s at the fifth, and ChatGLM-4′s zero-shot and few-shot metrics differ at the fifth. Since the differences are minimal, the table does not include values beyond the third decimal place).
Table 6. Macro-averaged Precision, recall, and F1-Score comparison across models and prompting condition (The reported values are rounded to three decimal places for simplicity. Some values may have subtle differences beyond the three decimal place that are not visible in the table. For example, GPT-4′s one-shot metrics differ starting at the fourth decimal place, LLaMA-3′s at the fifth, and ChatGLM-4′s zero-shot and few-shot metrics differ at the fifth. Since the differences are minimal, the table does not include values beyond the third decimal place).
ModelPrompting ConditionMacro-PrecisionMacro-RecallMacro-F1-Score
GPT-4Zero-shot0.8830.8830.883
One-shot0.8920.8920.892
Few-shot0.8830.8830.883
ChatGLM-4Zero-shot0.9130.9130.912
One-shot0.9330.9330.933
Few-shot0.9290.9290.929
LLaMA-3Zero-shot0.8500.8460.845
One-shot0.7790.7790.779
Few-shot0.8260.8210.820
Table 7. Paired t-test results for model comparisons.
Table 7. Paired t-test results for model comparisons.
Model Comparisont-Statisticp-ValueSignificance
GPT-4 vs. ChatGLM-4−2.200.029Statistically Significant
GPT-4 vs. LLaMA-32.730.007Statistically Significant
ChatGLM-4 vs. GPT-42.200.029Statistically Significant
ChatGLM-4 vs. LLaMA-34.655.61 × 10−0.6Highly Statistically Significant
LLaMA-3 vs. GPT-4−2.730.007Statistically Significant
LLaMA-3 vs. ChatGLM-4−4.655.61× 10−0.6Highly Statistically Significant
Table 8. Classification of the same error types across three models (percentages exceed 100% because a single sentence may exhibit multiple error characteristics).
Table 8. Classification of the same error types across three models (percentages exceed 100% because a single sentence may exhibit multiple error characteristics).
Error TypeDefinitionExampleNumber (N = 19) and Percentage
Surface Cue
Misguidance
Model is misled by linear proximity, salience bias, or shallow lexical cues.小明把椅子拉到钢琴旁,但是Ø是坏的,所以他只好唱歌。 Sam pulled up a chair to the piano, but it was broken, so he had to sing instead. (Ø = piano) 16 (84.2%)
Semantic Role ConflictConfusion of agent/theme/experiencer roles; ambiguous predicate–argument structure.志明对老王很生气,因为Ø买自他的吐司机是坏的。 Frank was upset with Tom because the toaster he had bought from him didn’t work. (Ø = Frank) 14 (73.7%)
Pragmatic &
Commonsense Reasoning
Requires world knowledge:
spatial, temporal, emotional, property, container–content,
motivation, stereotypes.
这张桌子不能通过门口,因为ø太窄了。 The table won’t fit through the doorway because it is too narrow. (Ø = doorway) 13 (68.4%)
Topic-Chain
Ambiguity
Weak or disrupted discourse topic continuity; unclear topic prominence.建国是唯一在世而且记得我父亲婴儿时样子的人。第一次见到我父亲时,Ø才十二个月大。 Fred is the only man alive who still remembers my father as an infant. When Fred first saw my father, he was twelve months old. (Ø = my father) 6 (31.6%)
Logical
Connective
Misinterpretation
Causal/concessive markers (因为 because, 但是 but, 尽管 although) incorrectly parsed.建国得到了免费的歌剧票,但他给了李广,因为Ø特别想看。 George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it. (Ø = Eric)5 (26.3%)
Long-Distance
Dependency
Antecedent requires maintaining reference over longer spans;
missing cues.
建国是唯一在世而且记得我父亲婴儿时样子的人。第一次见到我父亲时,Ø才十二个月大。 Fred is the only man alive who still remembers my father as an infant. When Fred first saw my father, he was twelve months old. (Ø = my father)3 (15.8%)
Syntactic
Ambiguity
Multiple grammatically valid
antecedents; structural ambiguity.
李静很乐意用她的毛衣交换我的夹克。她觉得Ø穿在身上显得很土气。 Grace was happy to trade me her sweater for my jacket. She thinks it looks dowdy on her. (Ø = sweater)2 (10.5%)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Qiao, Y.; Chen, X.; Chen, S. Evaluating Large Language Models on Chinese Zero Anaphora: A Symmetric Winograd-Style Minimal-Pair Benchmark. Symmetry 2026, 18, 47. https://doi.org/10.3390/sym18010047

AMA Style

Li Z, Qiao Y, Chen X, Chen S. Evaluating Large Language Models on Chinese Zero Anaphora: A Symmetric Winograd-Style Minimal-Pair Benchmark. Symmetry. 2026; 18(1):47. https://doi.org/10.3390/sym18010047

Chicago/Turabian Style

Li, Zimeng, Yichen Qiao, Xiaoran Chen, and Shuangshuang Chen. 2026. "Evaluating Large Language Models on Chinese Zero Anaphora: A Symmetric Winograd-Style Minimal-Pair Benchmark" Symmetry 18, no. 1: 47. https://doi.org/10.3390/sym18010047

APA Style

Li, Z., Qiao, Y., Chen, X., & Chen, S. (2026). Evaluating Large Language Models on Chinese Zero Anaphora: A Symmetric Winograd-Style Minimal-Pair Benchmark. Symmetry, 18(1), 47. https://doi.org/10.3390/sym18010047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop