Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation

Su, Jiayi; Tao, Meiling; Liang, Xuechen; He, Yangfan; Tao, Yiling; Zhang, Miao

doi:10.3390/sym17081274

Open AccessArticle

Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation

by

Jiayi Su

^1,†,

Meiling Tao

^2,†,

Xuechen Liang

^3,†,

Yangfan He

⁴,

Yiling Tao

⁵ and

Miao Zhang

^6,*

¹

School of Computing and Data Science, Xiamen University Malaysia, Sepang 43900, Malaysia

²

Department of Computer Science, Guangdong University of Technology, Guangzhou 510006, China

³

Department of Computer Science, East China Jiaotong University, Nanchang 330013, China

⁴

Department of Computer Science & Engineering, University of Minnesota, Twin Cities, Minneapolis, MN 55414, USA

⁵

Department of Computer Science, Sun Yat-sen University, Zhuhai 519082, China

⁶

Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2025, 17(8), 1274; https://doi.org/10.3390/sym17081274

Submission received: 19 June 2025 / Revised: 20 July 2025 / Accepted: 23 July 2025 / Published: 8 August 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Recent advancements in large language models (LLMs) have unlocked the potential for generating high-quality game commentary. However, producing insightful and engaging commentary for complex games with incomplete information remains a significant challenge. In this paper, we introduce a novel commentary method that combines reinforcement learning (RL) and LLMs, tailored specifically for the Chinese card game Guandan. Our system leverages RL to generate intricate card-playing scenarios and employs LLMs to generate corresponding commentary text, effectively emulating the strategic analysis and narrative prowess of professional commentators. The framework comprises a state commentary guide, a Theory of Mind (ToM)-based strategy analyzer, and a style retrieval module, which seamlessly collaborate to deliver detailed and context-relevant game commentary in the Chinese language environment. We empower LLMs with ToM capabilities and refine both retrieval and information filtering mechanisms. This facilitates the generation of personalized commentary content. Our experimental results demonstrate a significant improvement in the system’s effectiveness in generating accurate, coherent, and engaging commentary when applied to open-source LLMs, surpassing GPT-4 across multiple evaluation metrics.

Keywords:

large language model; retrieval-augmented generation; guandan commentary; commentary strategies

1. Introduction

Large language models (LLMs) have made significant advances in the field of natural language processing in recent years, particularly demonstrating exceptional capabilities in complex text generation and context understanding. This opens up new possibilities in the field of game commentary generation, enabling the production of insight-rich and contextually detailed commentary. The concept of symmetry plays a crucial role in this domain, as it pertains to the balanced and harmonious generation of commentary, mirroring the game’s strategic elements while maintaining a consistent, unbiased narrative flow. This alignment with symmetry ensures that the generated commentary remains coherent and well structured, reflecting the underlying order and dynamics of the game itself. However, generating high-quality game commentary still faces multiple challenges, including a deep understanding of game rules and strategies, the smooth generation of real-time commentary, and engaging narrative skills.

Previous work makes some progress in game commentary generation. For example, refs. [1,2] report preliminary explorations into chess commentary, but these methods are mainly based on simple rules and limited by the scale and quality of datasets. Ref. [3] applies advanced techniques in Shogi games, generating commentary by extracting key terms from game states and integrating them with language models, marking a step towards more complex generation methods. However, game commentary generation still faces a series of challenges, especially in handling incomplete information games [4], simulating human commentators’ advanced strategy analysis, and with applications in non-English environments. Moreover, existing methods still fall short in data-driven deep learning applications, limiting the naturalness and richness of generated commentary [5,6].

To assess the performance of LLMs in cooperative and incomplete information environments, we select the popular Chinese card game Guandan https://en.wikipedia.org/wiki/Guandan (accessed on 1 July 2025). The game’s objective is for two players on the same team to play out all their cards as quickly as possible to score higher team points. Consistent with recent work, we evaluate commercial and Chinese open-source models in a zero-training setup on the state-of-the-art RL agent Danzero+’s [7] playing behavior to assess their ability to understand and interpret agent actions in a cooperative, incomplete information environment.

In this study, we develop a novel commentary method that combines RL and LLM, specifically designed for the complex, incomplete information card game Guandan. By utilizing RL to generate complex card-playing scenarios, and LLM to generate corresponding commentary text, our system effectively simulates the strategy analysis and narrative skills of professional commentators. The system consists of several core components, including a state commentary guide, a ToM-based strategy analyzer [8], and a style retrieval module, as illustrated in Figure 1. These modules work together to provide detailed and context-relevant game commentary, while optimizing commentary effects in the Chinese environment. Experimental results show that our commentary framework significantly enhances the performance of open-source LLMs, significantly outperforming GPT-4 on multiple evaluation metrics.

We introduce a novel commentary method that combines Reinforcement Learning (RL) and Large Language Models (LLMs) to enhance commentary generation for the complex card game Guandan.

We equip LLMs with Theory of Mind (ToM) capabilities, optimizing retrieval and information filtering mechanisms. This allows LLMs to effectively analyze the game-playing processes of RL agents and produce more personalized, context-aware commentary.
We design a specialized commentary agent framework tailored for Guandan, a card game characterized by incomplete information. This framework enables the agent to handle complex game situations and generate insightful commentary without requiring specific training in the game environment.
Our experimental results demonstrate that the proposed framework significantly enhances the performance of open-source LLMs, outperforming GPT-4 across multiple evaluation metrics.

In this paper, we first introduce the problem of generating game commentary for the Chinese card game Guandan, focusing on the challenges posed by incomplete information and the need for strategic analysis. In Section 2, we review related work in the field of game commentary generation, including both traditional rule-based methods and more recent approaches using large language models (LLMs). Section 3 details the methodology, where we present our proposed framework, including the key components of our system, such as reinforcement learning (RL) integration and theory of mind (ToM) for strategic analysis. The experimental setup, evaluation metrics, and results are discussed in Section 4, followed by a summary of conclusions and potential future work in Section 5.

2. Related Work

2.1. Game Commentary Generation

In the field of game commentary generation, there has been significant progress in research in recent years. Initially, due to the lack of large-scale training data, researchers primarily used rule-based methods, such as generating commentary through rules in sports and chess games to enhance the viewer experience [1,2]. With the development of deep learning technologies, neural networks began to be applied to commentary generation, such as detailed commentary for soccer and baseball games [9,10], and using encoder-decoder models [6] and hierarchical models to generate commentary for eSports [11] games like League of Legends. Additionally, the rise of LLMs brought a new dimension to game commentary, handling multiple tasks through zero-shot learning [12] and providing a superior user experience.

Researchers have also explored methods combining visual and structured data, such as in racing games where [13] generates automatic commentary by combining multiple data sources. Meanwhile, ref. [14] introduces a large-scale commentary dataset in the field of chess, improving the accuracy and fluency of the generated commentary. More recently, ref. [15] applies LLMs to the generation of commentary for fighting games, demonstrating the model’s potential in generating diverse and preferential commentary. Based on these research findings, this paper aims to further explore the application of methods combining reinforcement learning and an LLM in Guandan, a complex card game with incomplete information, to further advance the field of game commentary generation.

2.2. Retrieval-Augmented Generation

In the field of Retrieval-Augmented Generation (RAG), significant progress has been made recently. One study [16] introduces the RAGAs evaluation framework for rapid assessment of Retrieval-Augmented Generation system performance, which is particularly suitable for the rapid deployment of LLMs.

Another study investigates the domain adaptability of RAG models and proposes the RAG-end2end approach, enabling the models to adapt to specific domain knowledge bases [17]. Additionally, researchers have proposed MuRAG, a multi-modal Retrieval-Augmented Generator. It leverages external non-parametric multi-modal memory for enhanced language generation, demonstrating outstanding cross-dataset performance in tasks involving image and text-based questions and answers [18]. Finally, another study develops the ARM-RAG system, which enhances problem-solving performance by storing and retrieving inference chains. This effectively boosts the intelligence of large language models while reducing training costs [19]. These advancements demonstrate the potential of RAG technology in enhancing the accuracy of knowledge access, the quality of generation, and the narrative ability. Our research explores the integration of RAG with domain-specific strategies to enhance the strategic depth of game commentary.

3. Method

The study introduces a modular approach (Figure 1) that enables a commentary agent based on LLMs to deliver detailed commentaries on cooperation and strategic interactions with opponent agents in Guandan, a card game of incomplete information, without the need for specialized training in Chinese text environments. This task is broken down into several core components, including a State Commentary Guider, ToM Strategy Analyzer, and Style Retriever. The inputs received by the LLM include game rules, observation rules (i.e., prompts that guide the LLM to convert low-level game state information into readable text), and historical game context. We provide a detailed description process demonstrating how to guide the LLM to utilize its knowledge, reasoning capabilities, and ToM to perform these module functions and effectively navigate the complexities inherent in incomplete information games.

The commentary generation process follows a structured pipeline: Game State + History + Rules → State Commentary Guider → Obsr, Hisr → ToM Analyzer (First/Second-Order Prompted Inference) → Style Retrieval and CoT Filtering → Commentary Coordinator → Final Output. This design enables the system to progressively transform raw game inputs into strategic, stylistically aligned, and context-aware commentary through a sequence of reasoning and retrieval-enhanced modules.

3.1. State Commentary Guider

In commentary for the Guandan game, the process starts with the State Commentary Guider, which converts the low-level game state information into a preliminary commentary text. This initial text provides basic context about the game, such as which cards were played and the current situation. Next, the Theory of Mind (ToM)-Based Strategy Analyzer analyzes this commentary to identify the psychological state of the players and predict their next moves. This step is crucial in adding depth to the commentary by providing strategic insights about players’ likely actions. After this, the Style Retrieval Module applies a tree-based retrieval method to filter the commentary generated so far, selecting the text that aligns with the desired commentary style (e.g., formal or casual, detailed or high-level). Finally, the Commentary Coordinator integrates all the components to produce a coherent, contextually rich commentary. This step ensures that the final output is fluent, informative, and logically consistent with the game’s progression.

A State Commentary Guide module is proposed, aimed at assisting the language model in transforming the game state into readable commentary text. This module integrates game rules, observation transformation rules, and historical transformation rules, guiding the model to convert the current low-level state and historical information into descriptive text to support the commentary process. Please refer to Appendix A for detailed rules. We design structured prompts to help the LLM understand the game rules and current state information, including card types, scoring rules, and win/loss conditions.

An example of the game rule template is as follows: The game rule template is shown as follows:

General Rules: A brief game introduction, team position rules, all single cards, and cards ranking.
Card Type that cannot beat other types: {Description of Card Type 1, Example of Card Type 1}, {Description of Card Type 2, Example of Card Type 2}, ...;
Card Type that can beat other types: {Description of Card Type 1, Example of Card Type 1}, {Description of Card Type 2, Example of Card Type 2}, ...;
Single Win/Loss Rule: The scoring rules for four players in a single game, with different combinations of cards being played out in different order.
Whole Win/Loss Rule: The overall win/loss conditions.

We also design templates for observation rules and historical transformation rules, which include input interpretations and transformation prompts. Through these templates, the model can convert low-level game states and history into vivid commentary text, making it easier for the audience to understand the game’s progression and players’ strategies. The template is shown as follows:

Input Explanation: The input types, like dictionaries and lists, are clearly specified. A description of every component in the input is also provided.
Conversion Tips: Additional instructions guide converting the low-level game state representations into natural language descriptions.

We can effectively convert low-level game states and history records into human-readable text, denoted as

{Obs}_{r}

and

{His}_{r}

, respectively, by employing the game rule, observation conversion rule, and history conversion rule. In detail, Obs_r refers to a sequence of observations, indexed from 1 to N, where N is the length of the generated commentary. This process enhances the clarity and comprehensibility of the game commentary, providing more vivid and understandable content for the audience. The conditional distribution for each element

{Obs}_{r} [j]

within the generated text can be modeled using the prompts

{Prompt}_{o b s}

as:

P (O b s_{r}) \sim \prod_{j = 1}^{N} L M_{θ} (O b s_{r} [j] ∣ P r o m p t_{o b s}, R u l e, R u l e_{o b s}, O b s_{r} [1, \dots, j - 1])

(1)

{LM}_{θ}

represents the language model parameterized by

θ

, and

N

is the length of the generated text

{Obs}_{r}

. This State Commentary Guide module provides crucial support for the model to commentate in games of incomplete information. A theoretical guarantee for the mechanism design is provided in the following content.

The theory of mechanism design is crucial for the architecture of large language models (LLMs) framework design, as it provides a principled framework to align model outputs with desired objectives through incentive-compatible structures. Below is a detailed proof based on mathematical mechanism design, demonstrating that Compliant Commentary Generation is feasible.

Each state–rule pair

(s_{t}, R)

is encoded as a vector

v_{t}

. It is then proven that the covered space

V \subset R^{d}

is compact (since both S and R are finite). By the universal approximation theorem [20], there exists a neural network

{LM}_{θ}

such that:

sup_{v \in V} ∥{LM}_{θ} (v) - P (c ∣ v)∥ < ε

(2)

where c denotes a Compliant Commentary Generation. The generation process forms a Markov chain whose stationary distribution

π (c)

satisfies:

π (c) = \sum_{v} P (c ∣ v) P (v)

(3)

After that, the convergence theorem is guaranteed by the Perron–Frobenius theorem [21],

P ({LM}_{θ^{*}} (s, {His}_{r}) is compliant) \geq 1 - ε .

(4)

Theorem 1.

Let S be a finite state space and R a finite rule set. Then, each state–rule pair

(s_{t}, R)

can be encoded as a vector

v_{t} \in R^{d}

, such that the space of all encoded vectors

V \subset R^{d}

is compact.

It is to be shown that the set

V = {v_{t} \in R^{d} ∣ v_{t} encodes the state - rule pair (s_{t}, R), s_{t} \in S, R \in R}

(5)

is compact, since both S and

R

are finite. The Cartesian product

S \times R

is finite. The encoding function

f : S \times R \to R^{d}, (s_{t}, R) \mapsto v_{t}

(6)

assigns a vector

v_{t}

in

R^{d}

to each state–rule pair. Therefore, the set

V = f (S \times R)

must be finite because it is the image of a finite set under the function f.

Any finite subset of

R^{d}

is bounded. In particular, there exists some

M > 0

such that for every

v_{t} \in V

, the norm satisfies

∥ v_{t} ∥ \leq M .

A finite set has no limit points other than its own elements, and every sequence in a finite set eventually becomes constant (or repeats elements). Thus, all limit points of V are contained in V itself, implying that V is closed.

Then, in

R^{d}

, the Heine–Borel theorem [22] tells us that a subset is compact if, and only if, it is closed and bounded. Since we have shown that V is both closed and bounded, it follows that V is compact.

Theorem 2.

Let

V \subset R^{d}

be a compact set, and let

P (c ∣ v)

denote the conditional distribution over compliant commentary c given input vector

v \in V

. Then, by the universal approximation theorem [20], there exists a neural network

{LM}_{θ}

such that:

sup_{v \in V} ∥{LM}_{θ} (v) - P (c ∣ v)∥ < ε,

(7)

for any

ε > 0

, where

{LM}_{θ} (v)

approximates the target distribution

P (c ∣ v)

.

Assume that the target function

f (v) = P (c ∣ v),

which returns the probability (or an appropriate representation) of a Compliant Commentary Generation c based on the encoding v, is continuous on the compact set V.

The universal approximation theorem (see [20]) asserts that for any continuous function defined on a compact subset of

R^{d}

, and for any

ε > 0

, there exists a feedforward neural network

{LM}_{θ}

with appropriate architecture and parameters

θ

such that

sup_{v \in V} ∥{LM}_{θ} (v) - f (v)∥ < ε .

(8)

Substituting

f (v) = P (c ∣ v)

yields

sup_{v \in V} ∥{LM}_{θ} (v) - P (c ∣ v)∥ < ε .

(9)

Theorem 3.

Let c denote a compliant commentary generation. Suppose the generation process induced by the language model

{LM}_{θ}

forms a Markov chain over the commentary space. Then, the stationary distribution

π (c)

of this process is given by:

π (c) = \sum_{v} P (c ∣ v) P (v),

(10)

where v denotes the encoded state–rule representations. Furthermore, by the Perron–Frobenius theorem [21], the Markov chain converges to

π (c)

, and the learned model

{LM}_{θ^{*}}

satisfies:

P ({LM}_{θ^{*}} (s, {His}_{r}) is compliant) \geq 1 - ε .

(11)

Assume that the commentary generation process evolves over discrete time steps t such that the generated commentary

c_{t}

only depends on the previous commentary

c_{t - 1}

. Thus, the process forms a Markov chain with the state space being the set of all compliant commentaries. Let the transition probabilities be given by

P (c^{'} ∣ c) = P (c_{t + 1} = c^{'} ∣ c_{t} = c),

(12)

there exists a unique stationary distribution

π (c)

that satisfies

π (c^{'}) = \sum_{c} π (c) P (c^{'} ∣ c) .

(13)

This stationary distribution represents the long-term behavior of the chain. Then, the Perron–Frobenius theorem applied to the transition probability matrix P (a non-negative matrix) guarantees that: (1) The spectral radius of P is 1. (2) There is a unique positive eigenvector (up to scaling) corresponding to the eigenvalue 1.

Assume that the training of the neural network

{LM}_{θ^{*}}

is designed so that the target (or true) process overwhelmingly produces compliant commentaries. In other words, the intended stationary distribution

π (c)

for a compliant commentary c satisfies

π (c) \geq 1 - ε .

(14)

This is a consequence of the training objective and approximation guarantees (e.g., from the universal approximation theorem), ensuring that the probability of compliance asymptotically is at least

1 - ε

.

Since the generation process is implemented via the neural network

{LM}_{θ^{*}}

, we have by the convergence property of the Markov chain:

lim_{t \to \infty} P ({LM}_{θ^{*}} (s, {His}_{r}) produces c) = π (c) .

(15)

Thus, when the system has run for a sufficiently long time, the probability that

{LM}_{θ^{*}} (s, {His}_{r})

produces a compliant commentary is at least

1 - ε

:

P ({LM}_{θ^{*}} (s, {His}_{r}) is compliant) \geq 1 - ε .

(16)

3.2. TOM-Based Strategy Analyzer

In the commentary of the card game, Guandan, the psychological tactics and strategic interactions within the game are difficult to accurately assess from a single player’s perspective. Therefore, we have adopted the ToM [4,8,23] approach to enhance the depth and strategy of the commentary. The commentary model is based on processing incomplete information of the game and inferring the psychological states of players, thereby providing accurate analysis and predictions in a complex multi-player interaction environment. For further information, refer to Algorithm 1.

First-order ToM: We utilize first-order ToM for basic strategy analysis. By analyzing players’ past actions and the current situation, we infer the possible hand types each player may hold and analyze their potential strategies. This information is utilized to construct commentary content, explaining players’ actions and potential counter-strategies by opponents. For example, if a player consistently chooses to play certain cards, we can infer that they might hold strong cards and possibly intend to suppress their opponents.
Second-order ToM: We introduce second-order ToM for a deeper level of strategy analysis. At this stage, we not only consider players’ strategies and actions but also predict opponents’ cognition and reactions to these strategies. Through this approach, we can interpret the game progression and players’ strategic tendencies more comprehensively. For instance, if a player adopts a relatively adventurous strategy in a certain situation, we may speculate that they believe their opponents would not anticipate this move, deliberately selecting this strategy.

Algorithm 1 Theory of Mind (ToM) Inference Step

Input:
obs (current game observations)
history (game history up to the current point)
player_id (ID of the player whose strategy is being inferred)
Output:
commentary (generated commentary based on ToM inference)
procedure tom_inference_step (obs, history, player_id)
first_order ← LLM.generate (“What is player player_id’s likely strategy?”, context = obs + history)
second_order ← LLM.generate (“What does player player_id think opponents
expect them to do?”, context = obs + history)
commentary ← integrate_to_commentary (first_order, second_order)
return commentary
end procedure

To ensure the logical coherence and fluency of the commentary content, after stepwise reasoning on strategies, we introduce a Commentary Coordinator to review the output of the game commentary. The Commentary Coordinator is responsible for integrating various commentary parts, ensuring the content is both accurate and easy to understand, and seamlessly blending with the game progression. Through this method, we can provide audiences with deeper game commentary, allowing them to better understand the strategic competition and psychological tactics among players.

3.3. Style Retrieval and Extraction

We introduce a Style Retrieval and Extraction module specifically designed for the Guandan game to enhance the accuracy and relevance of information in the game commentary system. The module is divided into two main stages: data retrieval and information filtering.

Data Retrieval: In the data retrieval phase of Guandan game commentary, we adopt a tree structure retrieval method to efficiently extract information most relevant to user queries from a corpus specifically designed for the Guandan game. In this process, each document is broken down into individual document nodes, each containing the content of the original document and a unique identifier, to facilitate subsequent vector indexing. The content of these document nodes is then converted into vector form and indexed using a vector space model.

During the query execution phase, the system first converts the user’s query request into vector form to ensure that the query can be effectively compared with the vectorized document nodes. Then, the system searches the constructed vector index, returning only document nodes whose similarity exceeds a preset threshold. Our threshold is set to 2, meaning that the system filters out nodes with a similarity greater than this value, ensuring that only the most relevant information is extracted.

Information Filtering: In the information filtering stage, the model meticulously filters the retrieved data. The system examines the relevance of each data item, retaining only those that meet high standards for subsequent processing. The content outputted by the Style Retrieval and Extraction module will provide curated, relevant data to the Commentary Coordinator to support the analysis and explanation of game situations and player strategies during the commentary process.

4. Experiments and Result Analysis

4.1. Implementation Details

To study the performance of an LLM-based agent in Guandan commentary under conditions of incomplete information, lack of communication, and dynamic collaboration, we choose Guandan as the experimental environment. In our tests, we use Danzero+ [7] as the game agent. Danzero+ employs a distributed framework to train the reinforcement learning model using the deep Monte Carlo method. They demonstrate their capability to perform at a human level. All experiments are conducted in a Chinese environment. We conducted tests on both open-source and close-source language models, including OpenAI’s commercial model GPT-4 and GPT-3.5 [24], and Chinese-language open-source models like Qwen-32B [25], Yi-34B [26], ChatGLM-4 [27].

In the data processing stage, we first remove non-text elements and stopwords from the texts to reduce noise. Then, we conducted text normalization, including unifying the case of the text to eliminate differences caused by case inconsistency. We also applied part-of-speech tagging and tokenization techniques. Finally, we use the Porter Stemmer algorithm to simplify word inflections [28].

4.2. Dataset

Our dataset includes professionally validated commentary texts covering 250 game sessions (5832 segments) across eight Guandan variants.

Professional commentary: Transcribed from match videos with two-stage validation:

Technical accuracy (≥90% compliance) verified by game experts (2+ years experience)
Strategic relevance scored by professionals (avg. 4.5/5)

These texts demonstrate high-level strategic insights and culturally authentic expression.

Generated commentary: Produced by the commentary model based on actual match data, simulating the style and content of professional commentary.

4.3. Metrics

To comprehensively evaluate the quality and applicability of the commentary texts, we use the following metrics:

Cosine Similarity: We employ the TF-IDF vectorization method [29] to convert processed texts into vector form and calculate the cosine similarity between professional and test texts. This measures the semantic closeness of professional and generated texts, reflecting the model’s capability to capture semantics.
Sentiment Analysis: Through sentiment analysis [30] tools, we assign sentiment polarity scores to the texts to compare the emotional expression differences between the two types of texts.
Lexical Diversity: We use the Type-Token Ratio (TTR) [31] to assess the lexical diversity of the texts. This metric measures the proportion of different words in the text relative to the total number of words, reflecting the richness and diversity of the language.
SNOWNLP: The SNOWNLP [32] score is a sentiment score ranging from 0 (most negative) to 1 (most positive), computed using a Chinese text sentiment analysis library based on Naive Bayes.
Human Evaluation: We also conduct human evaluations, including match consistency and fluency. These evaluations are completed by human reviewers to verify whether the commentary texts accurately reflect the actual card game situations and assess the readability and naturalness of the texts. Reviewers need to have a certain knowledge and experience of Guandan to accurately judge the accuracy of the text descriptions.

Match Consistency
- Key Event Identification (KEI): Assess whether the commentary texts capture key events in the match, such as major turnarounds or critical decision points [33].
- Detail Accuracy: Check if the text descriptions of card types, scores, and player strategies are precise and correct.
Fluency
- Naturalness: Assess if the text language is smooth and free of grammatical errors or unnatural expressions.
- Information Organization: Evaluate if the text’s information is properly organized and can be understood in a logical order throughout the development of the match.
- Logical Coherence: Check if the narrative of events in the text is coherent and free from logical jumps or contradictions.

4.4. Results

In this experiment, we conduct a comparative analysis of the data presented in Table 1, highlighting the substantial benefits gained from utilizing the Retrieval-Augmented Generation (RAG) framework for commentary. The integration of techniques such as card memory and the retrieval mechanism in RAG results in a more authentic and professionally styled commentary, offering key improvements over standard models. Even with its larger parameter count, GPT-4 without retrieval underperforms compared to RAG-enabled models. The retrieval component enables commentators to provide more accurate analyses and predictions, resulting in deeper insights into match progress and a stronger emotional connection to the content.

For performance metrics: Accuracy refers to the correctness of the generated commentary in describing game elements (e.g., card types, scores, and player strategies), measured by our Detail Accuracy metric; Match Consistency evaluates logical progression across game states and alignment with previous moves/decisions, quantified through Key Event Identification (KEI) and Detail Accuracy metrics; Fluency assesses linguistic presentation quality, including naturalness, information organization, and narrative coherence. These metrics are deliberately separated because they measure fundamentally distinct aspects: Match Consistency focuses on factual/logical correctness in the game context, while Fluency evaluates linguistic presentation quality. This distinction follows established evaluation frameworks for game commentary systems [5,33].

When applied to card game commentary, such as Guandan, RAG utilizes Chain-of-Thought (CoT) retrieval, which further enhances the model’s ability to emulate the linguistic norms of expert commentators. Models equipped with retrieval features outperform large models without retrieval or those fine-tuned solely on commentary data.

All evaluated models exhibit neutrality in sentiment analysis. GPT-3.5, for instance, shows relatively low lexical diversity (0.09), whereas Qwen equipped with RAG and GPT-4 demonstrate higher diversity in the generated text. The near-perfect SnowNLP score (0.99) for the Yi-34B model does not indicate a clear advantage or reliable ability to mimic the commentary. Additionally, all models struggle with cosine similarity when compared to the original text, likely due to the improvisational nature of live commentary. Among them, Qwen with RAG exhibits a slight advantage in aligning with the source content.

The theoretical guarantees are empirically validated using the following methods: The Compliance Verification method involves human evaluation of detail accuracy (measuring accuracy), which scores 0.97 (as shown in Table 2), confirming the

ε

-bound compliance. The Vector Convergence method shows that a cosine similarity of 0.7955 (compared to 0.038 for GPT-4) demonstrates the preservation of compact encoding

v_{t}

.

Human Evaluation Analysis. We recruit 20 human annotators with Guandan experience for scoring. As shown in Table 2, in terms of match consistency, our model significantly outperforms other models, particularly excelling in Key Event Identification (KEI) [34]. This demonstrates that our model accurately captures crucial moments in the game, effectively reflecting the game’s turning points and climactic sections. In Detail Accuracy, our model also performs exceptionally well, accurately describing game elements such as card types, scores, and strategic actions. Regarding fluency, our framework scores high in naturalness (0.95), information organization (0.89), and logical coherence (4.34), highlighting the model’s ability to generate commentary text that is grammatically correct, logically structured, and contextually coherent.

In contrast, other models like GPT-4, despite performing well in general language generation tasks, often fail to show the same acuity in specialized game commentary scenarios. These models are often not sensitive enough to key events, sometimes missing significant turning points in the game or failing to fully utilize game-specific terminology and expressions. Additionally, in terms of fluency, although they can generate structurally sound sentences, they sometimes lack the ability to present information in a logical and captivating manner.

Case Study. As shown in Figure 2, our method not only focuses on the types of cards played but also delves into detailed analysis of the playing patterns and potential strategies of the players and their opponents. It provides recommendations based on the current and predicted game state, advising when to play high or low cards. This includes discussions on when to play single cards or pairs to maximize gameplay advantage, helping players not only understand the game rules but also master advanced strategies to enhance their gameplay level. In contrast, the Yi-34B model’s commentary seems to focus more on the sequence of cards played without deeply exploring the reasoning or strategic implications behind these choices. It lists actions such as leading with certain cards or choosing to pass, which, while informative, do not delve into the strategic significance of these decisions. See Appendix B for more output examples.

4.5. Ablation Studies

Ablation Studies on RAG and ToM. Table 3 shows that removing RAG (e.g., “Our(w/o RAG)(Vanilla)”) results in a cosine similarity of 0.0 with the original text, despite producing the maximum SNOWNLP score of 1.0 and a reasonably high lexical diversity (0.87 or 1.0). This outcome implies that, without retrieval, the generated text deviates significantly from the source content, even though it may appear stylistically diverse. In contrast, when RAG is introduced (“Our(w RAG)(1st-ToM)” or “Our(w RAG)(2nd-ToM)”), the cosine similarity rises markedly (up to 0.7955), indicating stronger alignment with the original text. However, this improvement is accompanied by a slight decrease in both lexical diversity and SNOWNLP scores, suggesting that retrieval imposes some constraints on free-form generation. Overall, these results underscore the importance of RAG for achieving higher semantic fidelity in commentary.
Ablation Result Analysis. The RAG model with the retrieval component displays significant disparities from the model lacking it across various dimensions. In terms of lexical diversity, the model incorporating retrieval exhibits somewhat constrained diversity, suggesting a potential limitation imposed by the retrieval component. Sentiment analysis using the SnowNLP tool reveals that the model without retrieval yields more pronounced sentiment results, diverging notably from the original text. This deviation may arise from the model’s greater freedom in generating emotional expressions, albeit resulting in a less faithful imitation of the source text. Conversely, regarding text semantic similarity, the model integrating retrieval showcases a distinct advantage, effectively highlighting the crucial role of the retrieval component in maintaining semantic coherence and bolstering text relevance.

In addition to retrieval, second-order ToM demonstrates a clear advantage over first-order ToM in semantic alignment. Without RAG, second-order ToM improves cosine similarity from 0.0126 to 0.0380, and, with RAG, from 0.7519 to 0.7955. These gains indicate that higher-order ToM facilitates a deeper understanding of strategic interactions and player intentions, enhancing the commentary’s accuracy. In Guandan, for instance, second-order ToM captures opponents’ potential counteractions and anticipations, resulting in more precise narration of pivotal turning points. As such, second-order ToM is instrumental in producing in-depth and contextually grounded game commentary.

5. Conclusions

In this study, we introduce a novel commentary framework that combines reinforcement learning and large language models to generate detailed, context-relevant commentary for the complex, incomplete information card game Guandan. Our modular approach, consisting of a State Commentary Guider, ToM Strategy Analyzer, and Style Retrieval module, enables the LLM to deliver insightful game commentary without the need for specialized training. Experimental results demonstrate that our commentary framework significantly enhances the performance of open-source LLMs, outperforming GPT-4 on multiple evaluation metrics. In the future, we aim to extend our commentary framework to other complex games and explore the integration of additional modalities, such as audio and video data, to further enrich the commentary generation process.

6. Code Availability

Our code is available at https://github.com/heimy2000/guandan, accessed on 1 July 2025.

Author Contributions

Conceptualization, J.S. and M.T.; methodology, X.L.; software, Y.H.; resources, Y.T.; writing—original draft preparation, J.S., M.T., X.L. and M.Z.; writing—review and editing, J.S., M.T., X.L. and M.Z.; supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Guandan Rules

Appendix A.1. Basic Rules

English

1. Players are divided into two teams, with teammates sitting opposite each other. The game uses two decks of playing cards.

2. A complete game of Guandan consists of several rounds, each with its own trump card. The trump card is the highest card below the jokers. For example, if the trump card is 2, the order of cards from lowest to highest is 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K, A, 2. If the trump card is 5, the order is 2, 3, 4, 6, 7, 8, 9, 10, J, Q, K, A, 5.

3. Winning a round: The first team whose member finishes their cards wins the round, but the round does not end until the order of finishing is determined, meaning three players must finish their cards.

4. Trump card calculation rule 1: The initial trump card value is 2, which then increases based on the win/loss outcomes of the teams.

5. Trump card calculation rule 2: Each team has its own trump card, starting from 2.

6. Trump card calculation rule 3, upgrades from 1–3 levels. If you are the first player to finish your hand in a round, your team wins the round, and your team’s trump card is upgraded (increased in value). The specific upgrade depends on your teammate’s finishing position. Best victory: if your teammate finishes second, upgrade by 3 levels (e.g., from 2 to 5); third finisher upgrades by 2 levels; fourth finisher upgrades by 1 level.

7. Winning the whole game of Guandan: A team that upgrades their trump card to A and achieves the best victory in that round wins the entire game of Guandan. The game only concludes completely then. If the team has a trump card of A, that round becomes a critical challenge, and the team has three chances (which can be non-consecutive rounds) to achieve the best victory. If they fail in these three rounds, the trump card resets to 2.

8. Lead team rotation: The lead team for a round is determined by whose trump card is in play. If that team wins the round (not necessarily the best victory), they retain the lead team status, and their trump card is used in the next round. Otherwise, if the opposing team wins, they become the lead team, and their trump card is used in the next round. Regardless of which team wins the round, the winning team’s trump card is upgraded according to rule 3 and applied in the next round.

9. Tribute (card handover): After the first two players have finished their cards, regardless of whether they are teammates or not, the two players who finished their cards later must surrender the highest card. If a player who needs to surrender cards has two Big Jokers, they can resist surrendering. Situation one: in the previous round, the top two players from the same team finish first, and each player from the opposing team contributes their highest card. Situation two: for other situations, follow the instructions in blue.

10. Tribute rule supplement 1: Dual jokers can refuse tribute (jokers need not be held by the same person in scenario one).

11. Tribute rule supplement 2: The first finisher can choose one of two cards offered in tribute. Whichever card they choose, that player starts the next round.

12. Tribute rule supplement 3: The player receiving tribute must return a card to the tribute giver, which can be semi-freely chosen (only a card valued 10 or lower; if all cards are above 10, the smallest must be returned).

13. Tribute rule supplement 4: The tribute must include the highest card, excluding jokers and the trump card, unless the trump card is a heart.

14. Wild card: The heart suit of the trump card (two cards) can act as any card except the jokers. For example, if the trump card is 6, then the heart 6 can be used as any other card.

Appendix A.2. Additional Rules

English

1. Ranking order: Four jokers are the highest, followed by any five-of-a-kind bomb, then straight flush, then bombs in descending order of card number (five-of-a-kind, four-of-a-kind), then the seven basic card types (single, pair, triplet, full house, straight, three consecutive pairs).

2. Only a full house (three cards of one rank and two cards of another rank) is allowed, not three cards with one, nor four cards with two.

3. For bombs, the ranking is based on the number of cards, not the rank of the cards. For example, 3333 < 22,222.

4. The Ace (A) can be the lowest in a straight A2345 or the highest in a straight 10JQKA.

5. Four players form two teams, sitting diagonally. The first player to finish is the top winner, followed by the second, third, and bottom. For example, if you finish first, you are the top winner, and your partner is the second winner, which is called a double top. Your opponents are the third and bottom winners, called double bottom. Double top (first winner and second winner) gets a three-level promotion. First winner and third winner get a two-level promotion, and first winner and bottom winner get a one-level promotion.

6. The player in the top position gets the highest card and gives tribute. In the next game, the player with the highest tribute starts.

7. Declaring cards: When a player has 10 or fewer cards, they must declare their card count once, specifying the exact number.

8. Shuffling, cutting, and drawing cards: In the first game: The player in the east shuffles, the player in the south cuts the deck, and a card is flipped (it must not be a joker or the heart 2). The flip starts from the south, and the person who draws the flipped card starts the game. In the second game: The upper stream’s upstream player shuffles, the upper stream cuts, and the downstream player draws first. In a double bottom scenario, the upper stream’s downstream player draws first. The drawing is counterclockwise.

Appendix B. Sample Commentary for Output

Appendix B.1. GPT-4

English

1. Lead with the nine of hearts; the opening lead of the nine of hearts is a probing move, aimed at observing opponents’ reactions to lower-ranking cards, thus inferring the potential distribution of high-ranking cards and trump cards.

2. Follow with the queen of clubs; the player chooses to follow suit with the queen of clubs, indicating a sequence of play and suggesting a coherent hand without the ten or jack of clubs.

3. Follow with the king of clubs; continuing to play in sequence indicates that the player’s clubs are relatively orderly, without major gaps.

4. Pass; the player passing indicates a possible lack of high-ranking cards such as aces or threes, implying a weakness in their hand.

5. Follow with the ace of clubs; after assessing the strength of the board, choosing to follow suit with the ace of clubs indicates the player’s attempt to control the tempo of the game.

6. Follow with the Small Joker; using the Small Joker to follow suit is a strong suppressing move, aimed at testing whether the opponent holds the Big Joker.

7. Follow with the Big Joker; by playing the Big Joker, the player clearly expresses a strong intent to seize control of the board, while also possibly intending to block certain combinations of cards by opponents.

8. Lead with the four of hearts; by playing a low single card, the player may be attempting to clear out scattered cards from their hand, while also possibly hinting at holding higher-ranking cards or trumps.

9. Follow with the ten of clubs; this sequence indicates that the player’s clubs are relatively orderly, and they are now attempting to build a larger combination.

10. Follow with the jack of diamonds; continuing to play smoothly suggests that the player is likely maintaining a balanced hand to facilitate better play in subsequent moves.

11. Follow with the queen of spades; this sequence of play continues to demonstrate the coherence of the player’s hand.

12. Follow with the king of hearts; by playing the king of hearts, the player may be attempting to clear a particular suit, reducing the complexity of their hand.

13. Lead with the four of diamonds; playing a low card at this point may be in preparation to clear smaller cards from the hand, maintaining an advantage in high-ranking cards.

14. Follow with the nine of diamonds; continuing to play in sequence emphasizes the coherence of the hand and reduces gaps in the board.

15. Follow with the three of diamonds; by playing the three of diamonds, the player further confirms the presence of higher-ranking cards and maintains control through low cards.

16. Lead with the six of spades; by playing the six of spades, the player attempts to control the board and reduce scattered cards in their hand, maintaining a strategic advantage.

17. Follow with the king of spades; by following suit with the king of spades, the player indicates that the previous player may not have larger single cards, continuing to maintain control of the board.

18. Follow with the ace of spades; by following suit with the ace of spades, the player may be attempting to further seize control of the board and take the initiative in the game.

19. Lead with three nines; by playing three nines, the player may be attempting to deduce the opponents’ hand and suppress them through combinations of card types.

20. Follow with three queens; continuing to play large cards may be an attempt to build a larger combination, while also trying to suppress opponents.

Appendix B.2. GPT-3.5

English

1. Lead with the nine of hearts: The player starts with the nine of hearts, which is a relatively weak hand, usually used to probe opponents’ hands and determine the presence of the Big Joker.

2. Follow with the queen of clubs: Subsequently, the player follows with the queen of clubs, indicating they do not have singles of 10 or jack of clubs.

3. Follow with the king of clubs: The player continues to follow with the king of clubs, maintaining the normal sequence of play.

4. Pass: The player passes, indicating they may not have aces or threes, and the presence of the Big Joker is still uncertain.

5. Follow with the ace of clubs: The player follows with the ace of clubs, further confirming the presence of the ace of clubs in their hand.

6. Press: The player presses, indicating they do not have single-ranking cards.

7. Lead with the four of diamonds: The player leads with the four of diamonds, guessing they still have level cards in their hand, possibly more than one.

8. Follow with the nine of diamonds: The player follows with the nine of diamonds, indicating they may not have single cards from 5 to 8.

9. Pass: The player passes, further confirming the absence of single cards.

10. Follow with the three of diamonds: The player follows with the three of diamonds, further confirming the presence of level cards in their hand, possibly more than one.

11. Lead with the six of spades: The player leads with the six of spades, guessing they still have level cards in their hand, aiming to control the board and reduce the number of cards in their hand.

12. Follow with the king of spades: The player follows with the king of spades, indicating they may not have single cards, as the highest single card, the king, has already been played.

13. Play four jokers with the declaration of ten cards: The player plays four jokers with the declaration of ten cards, indicating they still have multiple level cards in their hand.

14. Follow with four jokers: Subsequently, the player follows with four jokers, indicating they are also competing for control of the board.

15. Play three sixes with a pair of fours: The player plays three sixes with a pair of fours, indicating they still have two jokers and two other sets of cards.

16. Follow with three sevens with a pair of twos: The player follows with three sevens with a pair of twos, forming a suppression.

17. Play four twos: The player plays four twos, attempting to charge forward.

18. Follow with five fives: Subsequently, the player follows with five fives, indicating they are vying for control.

19. Lead with a pair of fours: The player leads with a pair of fours, guessing they are giving cards to the previous player.

20. Follow with a pair of eights, declaring eight cards: The player follows with a pair of eights, declaring eight cards.

Appendix B.3. Qwen-32B

English

1. Firstly, playing the 9 of hearts is a common opening strategy to test the distribution of the opponents’ level cards and kings.

2. Opponent calls the queen of clubs, which is a normal straight action, and could mean that he does not have the 10 and the jack singleton, or that they have been used as part of a bomb.

3. Following with the king of clubs also fits the sequence logic, indicating that the 10 and Jack of clubs are either retained or no longer in their hand.

4. Passing the turn, possibly due to having no better card in hand or wanting to observe the situation.

5. Passing again, as the teammate’s cards do not significantly impact the current decision-making, hence no need to focus on them.

6. Following with the ace of clubs means either there is an A-K pair or just a single ace, also attempting to take control of the hand.

7. Continuing to pass, possibly because there are no single level cards, or the single level cards are already part of other hands.

8. Still passing, as the teammate’s hand does not affect one’s own strategy.

9. Following with the Small Joker might be because the player holds the Big Joker, aiming to control the situation.

10. Passing, with no indication of the Big Joker.

11. Following with the Big Joker, clearly to suppress the opponents and vie for control of the hand.

12. Continuing to pass, because it is believed there are no better opportunities to play.

13. Leading with the 4 of hearts, because there are level cards or a king in hand, attempting to take the initiative.

14. Following with the 10 of clubs indicates there might be a 10–K pair in hand, or all single cards below 10 have been played.

15. Passing, possibly because there are no single cards bigger than the 10 of clubs.

16. Playing the Jack of diamonds follows the rule of sequence.

17. Playing the queen of spades also maintains the sequential rhythm.

18. Following with the king of hearts indicates there might be a K–Q pair or higher single cards in hand.

19. Continuing to pass, because it is thought there are no better opportunities to play.

20. Playing the Big Joker is an obvious suppression move, attempting to control the hand.

Appendix B.4. Yi-34B

English

1. Player A leads with the 9 of Hearts, typically a probing play to gauge whether opponents hold high cards or specific combinations like sequences or pairs.

2. Player B follows with the Queen of Clubs, suggesting B might have a sequence or is testing A’s hand.

3. Player C plays the King of Clubs, possibly to maintain a sequence or guess the hands of A and B.

4. Player D passes, likely indicating no suitable cards to play or waiting for a better opportunity.

5. Player A passes again, possibly to keep high cards in hand or waiting for the right moment.

6. Player B plays the Ace of Clubs, possibly to maintain a sequence or demonstrate strength.

7. Player C passes again, perhaps to conserve strong cards or waiting for an opportune moment.

8. Player D passes again, likely to hold onto high cards or wait for a better time.

9. Player A plays the Small Joker, possibly to maintain a sequence or show strength.

10. Player B passes again, possibly conserving strong cards or waiting for the right moment.

11. Player C passes again, likely keeping strong cards or waiting for an opportune time.

12. Player D plays the Big Joker, possibly to maintain a sequence or demonstrate strength.

13. Player A overcalls, indicating confidence that their hand is stronger than the others’.

14. Player B passes again, possibly to hold onto strong cards or wait for the right moment.

15. Player C passes again, likely keeping strong cards or waiting for an opportune time.

16. Player D overcalls, showing confidence that their hand is stronger than the others’.

17. Layer A passes again, possibly to keep strong cards in hand or waiting for the right moment.

18. Player B passes again, possibly conserving strong cards or waiting for the right moment.

19. Player C passes again, likely keeping strong cards or waiting for an opportune time.

20. Player D leads with the 4 of Hearts, possibly to test if other players have a sequence or pair of Hearts.

Appendix B.5. GLM-4

English

1. Leading with the 9 of Hearts, testing the distribution of trump cards and kings.

2. Following suit with the Queen of Clubs, a normal sequence play, likely not holding singleton 10 or Jack.

3. Following with the King of Clubs, a natural sequential play.

4. Passing, possible absence of the Ace and 3, high likelihood of not having a singleton Ace, kings are uncertain.

5. Passing, unnecessary for teammate to take action.

6. Following with the Ace of Clubs, guessing no singleton trump cards, trumps may form other combinations or no trumps at all, kings remain uncertain.

7. Passing, unnecessary for teammate to take action.

8. Following with the Small Joker, guessing no Big Joker.

9. Playing the Big Joker, asserting, seizing control.

10. Leading with the 4 of Hearts, guessing holding either a singleton trump or kings.

11. Following with the 10 of Clubs, guessing no cards larger than 10, uncertain if there is a singleton trump or king.

12. Following with the Jack of Diamonds, a normal sequential play.

13. Following with the Queen of Spades, a natural sequence play.

14. Following with the King of Hearts, a normal sequence play.

15. Passing, no need for teammate to take action.

16. Over-calling, no other choice. It can be inferred that there might not be a singleton trump card.

17. Passing, unnecessary for teammate to take action.

18. Over-calling, seizing control of the round. Currently, the highest card in play is the trump 3, confirming neither opponent nor partner have the singleton trump 3, and since I lack trumps, there is a higher likelihood that an opponent still has it.

19. Leading with the 4 of Diamonds, guessing holding more than one trump card in hand.

20. Followed by the 9 of diamonds, a normal straight, and one can guess that there is no singleton from 5 to 8.

Appendix B.6. Our-Part1

English

1. Pay attention to the card-playing patterns of the players to your right and left, especially whether they tend to play pairs or sequences. This could help you infer the possible card combinations they might have. Do not rush to play high single cards, particularly when you are unsure if your opponents hold them. Often, the last few cards are the most crucial as they might include key combinations that determine the outcome. Given the current scenario and your cards, I suggest initially testing your opponents’ hands. You might consider playing some small pairs (like 3, 4) or medium-sized singles (like the 7 of hearts). This way, even if the opponents have high cards, they might pause in front of your small pair. Then, based on their responses, gradually increase the strength and value of your cards until the final sprint.

2. Now, there are still three 2s unplayed, and our opposite player might have strong cards. If they do, we could try playing some high cards to extend our advantage. However, if they do not have strong cards, we need to be more cautious with our plays, ensuring each card maximizes our scoring potential while closely monitoring our opponents’ moves to adjust our strategy accordingly.

3. Because there are still likely 2s and Aces that can change the situation on the table. After several rounds of passing, the players found that their opponents were not eager to play high cards, suggesting a weak hand. The player cautiously chose to play a cross shape with a pair of Kings, hoping for a response from the following player. However, the next player continued to pass. After observing the previous player play three 3s with a heart, the player decided to challenge with a combination of four Queens to pressure the opponents and to gauge their probable hand types. The opposite player chose not to ’fire back,’ further confirming the earlier speculation about their weak cards.

4. You played the 10 of hearts, indicating you might have higher cards. The opponents responded with a pair of Aces, showing they were also actively aiming for victory. A 9 of spades was then played, to which you responded with a Jack of spades, indicating your strong hand. They quickly followed with a Queen of hearts, applying more pressure. You countered with a King of hearts, displaying confidence in this round. Subsequently, they played a pair consisting of the 2 of diamonds and the 3 of clubs to expand their lead, but you suppressed them with your strong Ace of spades. Unyielding, they attempted to find an opportunity with a 6 of clubs, but you remained composed, deciding not to play yet, waiting for a better moment to continue the game.

5. The previous player started with a 4 of hearts, and the next played the Ace of clubs. From my side, I played the 7, Queen, and Ace of spades, attempting to regain control, but the opposite player responded with the 10, Jack, Queen, and Ace of diamonds, showing they were strong contenders. The previous player then laid out a sequence of hearts: 2, 5, 6, 7, and King, indicating they might have a flush or straight. The next player followed with a sequence of clubs: 3, 4, 8, 9, and Jack, suggesting they might also hold strong clubs. I played the 2 and 4 of spades, trying to confuse the situation and defend with the 10 or Queen of clubs if necessary.

6. In this game, the player observed their hand containing several pairs, such as the 2 and 4 of diamonds, a pair of 5s, and two Kings, along with scattered cards like the 8 of clubs, King, various hearts, and the 4, 5, 8 of spades. In the first round, other players passed without taking the lead. In the second round of playing, the next player chose a pair of Jacks, and after considering, the player decided to follow with their single King of hearts.

7. The previous player played a pair of diamond 3s, and you chose to pass. The player opposite also passed. The next player played a single spade Ace, and you still passed. The opposite player then played a pair of heart 10s, and you chose to suppress it with a Small Joker, due to the lack of direct combating combinations in your hand. From my current hand, I need to form pairs, sequences, or flushes to counter the opponents.

Appendix B.7. Our-Part2

English

8. Currently, our hand includes a diamond 2 with an extra card, pairs of 7s and Kings, with single cards of club 8, club King, and heart 2. There is still one 2 left outside, three Aces, and some cards that might form a sequence or flush. The previous player puts out a pair of 7s which we still cannot follow, waiting for an opportunity or to disrupt the card types. Next, we need to observe the situation on the table, looking for the right opportunity to make a move.

9. In this round of Guandan, the players’ hands are highly diverse. They hold a diamond sequence from 2 to Ace, with two Kings, a pair of 7s, and a pair of Queens. The clubs also have several consecutive pairs and single cards like 8, King, 2, 5, 6 (duplicate), 7, and Queen. Hearts include 2, 3, 4, 5, 6, 7, duplicated 10s, Jack, and two Kings. Lastly, there are spades 4, duplicated 5s, 8, 9, Jack, King, and two Aces. The previous player continues to play single cards diamond 4 and diamond 9, but the player does not have the appropriate cards to follow. As the current hand cannot form effective sequences or pairs, the player opts for a passing strategy, waiting for a better chance to make a move. This series of actions shows the patience and tactical considerations in the Guandan game, not rushing for immediate gains but aiming for steady victory.

10. My hand still holds a diamond Ace, club Queen, club King, and spade Queen, spade King. I decide to play using a three with a pair approach: club Queen, club King, spade Queen, spade King, with the diamond Ace as a single card. This method in Guandan usually effectively clears the field and applies pressure on the opponents. Considering the possible counter strategies of the opponents, they might choose to follow or pass. If the opponents have the corresponding pairs or consecutive flushes, they might follow; if not, they might choose to pass, waiting for a better opportunity to play.

References

Liao, J.W.; Chang, J.S. Computer Generation of Chinese Commentary on Othello Games. In Proceedings of the Rocling III Computational Linguistics Conference III, Taipei, Taiwan, 21–23 September 1990; pp. 393–415. [Google Scholar]
Sadikov, A.; Možina, M.; Guid, M.; Krivec, J.; Bratko, I. Automated chess tutor. In Proceedings of the Computers and Games: 5th International Conference, CG 2006, Turin, Italy, 29–31 May 2006; pp. 13–25. [Google Scholar]
Kameko, H.; Mori, S.; Tsuruoka, Y. Learning a game commentary generator with grounded move expressions. In Proceedings of the 2015 IEEE Conference on Computational Intelligence and Games (CIG), Tainan, Taiwan, 31 August–2 September 2015; pp. 177–184. [Google Scholar]
Guo, J.; Yang, B.; Yoo, P.; Lin, B.Y.; Iwasawa, Y.; Matsuo, Y. Suspicion-Agent: Playing Imperfect Information Games with Theory of Mind Aware GPT-4. arXiv 2023, arXiv:2309.17277. [Google Scholar] [CrossRef]
Kim, B.J.; Choi, Y.S. Automatic baseball commentary generation using deep learning. In Proceedings of the the 35th Annual ACM Symposium on Applied Computing, Brno, Czech Republic, 30 March–3 April 2020; pp. 1056–1065. [Google Scholar]
Taniguchi, Y.; Feng, Y.; Takamura, H.; Okumura, M. Generating live soccer-match commentary from play data. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar]
Zhao, Y.; Lu, Y.; Zhao, J.; Zhou, W.; Li, H. DanZero+: Dominating the GuanDan Game through Reinforcement Learning. arXiv 2023, arXiv:2312.02561. [Google Scholar] [CrossRef]
Kosinski, M. Evaluating Large Language Models in Theory of Mind Tasks. arXiv 2023, arXiv:2302.02083. [Google Scholar] [CrossRef] [PubMed]
Puduppully, R.; Lapata, M. Data-to-text generation with macro planning. Trans. Assoc. Comput. Linguist. 2021, 9, 510–527. [Google Scholar] [CrossRef]
Gardent, C.; Shimorina, A.; Narayan, S.; Perez-Beltrachini, L. Creating training corpora for nlg micro-planning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017. Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 179–188. [Google Scholar]
Wang, Z.; Yoshinaga, N. Esports Data-to-commentary Generation on Large-scale Data-to-text Dataset. arXiv 2022, arXiv:2212.10935. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Ishigaki, T.; Topić, G.; Hamazono, Y.; Noji, H.; Kobayashi, I.; Miyao, Y.; Takamura, H. Generating racing game commentary from vision, language, and structured data. In Proceedings of the the 14th International Conference on Natural Language Generation, Scotland, UK, 20–24 September 2021; pp. 103–113. [Google Scholar]
Jhamtani, H.; Gangal, V.; Hovy, E.; Neubig, G.; Berg-Kirkpatrick, T. Learning to generate move-by-move commentary for chess games from large-scale social forum data. In Proceedings of the the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1661–1671. [Google Scholar]
Nimpattanavong, C.; Taveekitworachai, P.; Khan, I.; Nguyen, T.V.; Thawonmas, R.; Choensawat, W.; Sookhanaphibarn, K. Am I Fighting Well? Fighting Game Commentary Generation With ChatGPT. In Proceedings of the the 13th International Conference on Advances in Information Technology, Bangkok, Thailand, 6–9 December 2023; pp. 1–7. [Google Scholar]
Shahul, E.; James, J.; Anke, L.E.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv 2023, arXiv:2309.15217. [Google Scholar] [CrossRef]
Siriwardhana, S.; Weerasekera, R.; Wen, E.; Kaluarachchi, T.; Rana, R.; Nanayakkara, S. Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. Trans. Assoc. Comput. Linguist. 2022, 11, 1–17. [Google Scholar] [CrossRef]
Chen, W.; Hu, H.; Chen, X.; Verga, P.; Cohen, W.W. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. arXiv 2022, arXiv:2210.02928. [Google Scholar] [CrossRef]
Melz, E. Enhancing LLM Intelligence with ARM-RAG: Auxiliary Rationale Memory for Retrieval Augmented Generation. arXiv 2023, arXiv:2311.04177. [Google Scholar] [CrossRef]
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Networks 1991, 4, 251–257. [Google Scholar] [CrossRef]
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Rudin, W. Principles of Mathematical Analysis, 3rd ed.; McGraw-Hill: New York, NY, USA, 1976. [Google Scholar]
Frith, C.; Frith, U. Theory of mind. Curr. Biol. 2005, 15, R644–R646. [Google Scholar] [CrossRef] [PubMed]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar] [CrossRef]
Young, A.; Chen, B.; Li, C.; Huang, C.; Zhang, G.; Zhang, G.; Li, H.; Zhu, J.; Chen, J.; Chang, J.; et al. Yi: Open Foundation Models by 01.AI. arXiv 2024, arXiv:2403.04652. [Google Scholar] [CrossRef]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. Glm-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
Porter, M.F. An algorithm for suffix stripping. Program 1980, 14, 130–137. [Google Scholar] [CrossRef]
Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
Medhat, W.; Hassan, A.; Korashy, H. Sentiment analysis algorithms and applications: A survey. Ain Shams Eng. J. 2014, 5, 1093–1113. [Google Scholar] [CrossRef]
Richards, B. Type/token ratios: What do they really tell us? J. Child Lang. 1987, 14, 201–209. [Google Scholar] [CrossRef] [PubMed]
isnowfy. SnowNLP. 2013. Available online: https://github.com/isnowfy/snownlp (accessed on 28 March 2025).
Renella, N.; Eger, M. Towards automated video game commentary using generative AI. In Proceedings of the the AIIDE Workshop on Experimental AI in Games, Salt Lake City, UT, USA, 8 October 2023. [Google Scholar]
Lin, C.Y.; Hovy, E. Manual and automatic evaluation of summaries. In Proceedings of the the ACL-02 Workshop on Automatic Summarization. Association for Computational Linguistics, Philadelphia, PA, USA, 11–12 July 2002; pp. 8–13. [Google Scholar]

Figure 1. The figure illustrates the commentary process of a Guandan game, with inputs including the game state, game history, and the corresponding game rules and observation rules. (1) The system first uses a State Commentary Guider to transform these inputs into preliminary commentary text. (2) The ToM Strategy Analyzer receives this text and utilizes ToM to analyze players’ strategies and behaviors, predicting opponents’ potential psychological states and reactions.(3) A Style Retriever using COT prompts employs a tree-based retrieval method and information filtering system to extract statements that match a specific commentary style. (4) The Commentary Coordinator integrates all the commentary text to produce the final game commentary.

Figure 2. Simulation of multi-round outputs for different methods of game commentary.

Table 1. Evaluation results of commentary for open source and closed source models.

	Sentiment Analysis				Cosine Similarity	Lexical Diversity	SNOWNLP
	Neg	Neu	Pos	Compound	Cosine Similarity	Lexical Diversity	SNOWNLP
GPT-3.5	0.0	1.0	0.0	0.0	0.0032	0.09	0.0
GPT-4	0.0	1.0	0.0	0.0	0.0380	1.0	0.0
Yi-34B	0.0	1.0	0.0	0.0	0.0250	0.55	0.99
GLM-4	0.0	1.0	0.0	0.0	0.0050	0.86	0.98
Our	0.0	1.0	0.0	0.0	0.7955	0.95	0.0
Original	0.0	1.0	0.0	0.0	-	1.0	0.0

Table 2. Human evaluation of different benchmarks and our model.

	Match Consistency		Fluency
	KEI	Detail Accuracy	Naturalness	Information Organization	Logical Coherence
GPT-3.5	0.37	0.95	0.86	0.69	3.75
GPT-4	0.46	0.91	0.90	0.80	4.46
Yi-34B	0.23	0.87	0.82	0.62	3.52
GLM-4	0.32	0.83	0.84	0.58	2.72
Our	0.81	0.97	0.95	0.89	4.34

Table 3. Ablation study on the effect of the Tree-Based Retrieval Method (RAG) in commentary generation. The table compares different configurations of the system, including models with and without RAG and varying levels of Theory of Mind (ToM). The evaluation includes sentiment analysis (negative, neutral, positive, and compound scores), cosine similarity (measuring alignment with the original commentary), lexical diversity (indicating vocabulary richness), and SNOWNLP score (sentiment analysis score). Rows represent the following configurations: (1) Our(w/o RAG)(Vanilla): base model without RAG; (2) Our(w/o RAG)(1st-ToM): base model with first-order ToM; (3) Our(w/o RAG)(2nd-ToM): base model with second-order ToM; (4) Our(w RAG)(1st-ToM): model with RAG and first-order ToM; (5) Our(w RAG)(2nd-ToM): model with RAG and second-order ToM; and (6) Original: the reference commentary. Higher cosine similarity and lexical diversity indicate better performance in mimicking human-like commentary.

	Sentiment Analysis				Cosine	Lexical	SNOWNLP
	Neg	Neu	Pos	Compound	Similarity	Diversity
Our(w/o RAG)(Vanilla)	0.0	0.0	1.0	0.0	0.0	0.87	1.0
Our(w/o RAG)(1st-ToM)	0.0	0.0	1.0	0.0	0.0126	1.0	1.0
Our(w/o RAG)(2nd-ToM)	0.0	1.0	0.0	0.0	0.0380	1.0	1.0
Our(w RAG)(1st-ToM)	0.0	1.0	0.0	0.0	0.7519	0.92	0.0
Our(w RAG)(2nd-ToM)	0.0	1.0	0.0	0.0	0.7955	0.95	0.0
Original	0.0	1.0	0.0	0.0	-	1.0	0.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, J.; Tao, M.; Liang, X.; He, Y.; Tao, Y.; Zhang, M. Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation. Symmetry 2025, 17, 1274. https://doi.org/10.3390/sym17081274

AMA Style

Su J, Tao M, Liang X, He Y, Tao Y, Zhang M. Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation. Symmetry. 2025; 17(8):1274. https://doi.org/10.3390/sym17081274

Chicago/Turabian Style

Su, Jiayi, Meiling Tao, Xuechen Liang, Yangfan He, Yiling Tao, and Miao Zhang. 2025. "Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation" Symmetry 17, no. 8: 1274. https://doi.org/10.3390/sym17081274

APA Style

Su, J., Tao, M., Liang, X., He, Y., Tao, Y., & Zhang, M. (2025). Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation. Symmetry, 17(8), 1274. https://doi.org/10.3390/sym17081274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Commentary Strategies for Guandan: A Study of LLMs in Game Commentary Generation

Abstract

1. Introduction

2. Related Work

2.1. Game Commentary Generation

2.2. Retrieval-Augmented Generation

3. Method

3.1. State Commentary Guider

3.2. TOM-Based Strategy Analyzer

3.3. Style Retrieval and Extraction

4. Experiments and Result Analysis

4.1. Implementation Details

4.2. Dataset

4.3. Metrics

4.4. Results

4.5. Ablation Studies

5. Conclusions

6. Code Availability

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Guandan Rules

Appendix A.1. Basic Rules

Appendix A.2. Additional Rules

Appendix B. Sample Commentary for Output

Appendix B.1. GPT-4

Appendix B.2. GPT-3.5

Appendix B.3. Qwen-32B

Appendix B.4. Yi-34B

Appendix B.5. GLM-4

Appendix B.6. Our-Part1

Appendix B.7. Our-Part2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI