Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing

Ren, Yupei; Zhang, Ning; Li, Xiaoyu; Zhang, Yadong; Chen, Yuqing; Lan, Man

doi:10.3390/su18073338

Open AccessArticle

Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing

by

Yupei Ren

^1,2

,

Ning Zhang

³

,

Xiaoyu Li

⁴,

Yadong Zhang

²,

Yuqing Chen

^1,2

and

Man Lan

^1,2,*

¹

Shanghai Institute of Artificial Intelligence for Education, East China Normal University, Shanghai 200062, China

²

School of Computer Science and Technology, East China Normal University, Shanghai 200062, China

³

College of Education, Zhejiang University, Hangzhou 310058, China

⁴

School of Education, Yangzhou University, Yangzhou 225009, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(7), 3338; https://doi.org/10.3390/su18073338 (registering DOI)

Submission received: 18 January 2026 / Revised: 6 March 2026 / Accepted: 20 March 2026 / Published: 30 March 2026

Download

Browse Figures

Versions Notes

Abstract

As critical elements in argumentative writing, argument components and strategies significantly influence argument quality. However, the existing research lacks an in-depth exploration of how students construct and utilize these elements in argumentative writing. This study first evaluates the performance of leading large language models (LLMs) in identifying argument components and strategies using three approaches: single-task learning (STL), chain-of-thought (CoT), and multi-task learning (MTL). With the aid of learning analytics methods (Epistemic Network Analysis (ENA) and two-mode network), the study further reveals the intrinsic mechanisms linking argument components, strategies, and writing quality. Specifically, the research trains and evaluates LLMs on 226 argumentative essays, encompassing 4726 components and 4837 strategies. Compared to basic STL, the CoT and MTL methods significantly improve LLMs’ performance in both tasks. Moreover, learning analytics indicate that high-quality essays possess rich and complex logical relations, presenting multidimensional and multi-layered reasoning structures, whereas low-quality essays predominantly rely on simple and repetitive connections, lacking deeper logical support. These findings have significant implications for the automated analysis of argumentative writing and the sustainable development of education, not only providing valuable insights for educators in argumentation instruction but also contributing to the systematic enhancement of students’ argumentative abilities and critical thinking.

Keywords:

sustainable development in education; AI for education; educational application; argument mining; learning analytics

1. Introduction

Argumentation is a complex cognitive skill that involves analyzing claims or evidence and employing reasoning to make decisions or solve problems [1]. It is not only a fundamental part of daily communication and thought processes [2] but also a core competence for contemporary students and a key educational objective [3]. In the fields of science and social studies, students are often required to respond to prompts through argumentative essays that use evidence and factual support to demonstrate understanding of specific topics [1,4]. Argumentative writing serves as a comprehensive reflection of learning outcomes, fostering language proficiency and written expression while developing higher-order thinking, such as logical reasoning, critical thinking, and creative cognition [5].

Despite its recognized importance, numerous studies indicate that students experience difficulties with argumentative writing [6]. One major reason is the lack of accurate assessment and targeted guidance, which makes it difficult for students to master the complex argument elements and strategies required for effective argumentation [7]. Several studies have developed scaffolds (e.g., scripts or templates) based on Toulmin Argumentation Theory [2] to enhance students’ argumentation skills [8]. However, these studies typically rely on traditional methods, such as questionnaires [9,10], interviews [11,12], and manual coding [4], to assess argumentation skills. These methods are time-consuming, labor-intensive, and prone to subjective bias, which limits reliability and scalability. The emergence of Generative Artificial Intelligence (GAI), represented by ChatGPT, has revealed new possibilities for the automated analysis of argumentative writing [13,14]. Nevertheless, the application of large language models (LLMs) in education remains in its infancy, requiring further investigation to effectively leverage their extensive knowledge and powerful learning capabilities for analyzing argumentative features in students’ writing.

Furthermore, improving students’ argumentation skills requires a deeper understanding of how argument structure relates to writing quality [15,16]. Existing studies have explored this relationship using content analysis [17], statistical analysis [1,18], and network analysis [19,20]. However, most studies focus on lexical features, syntax, or sentence types and functions while overlooking the impact of inter-sentence interactions on argument quality. In argumentative writing, the complex interplay between sentence components reflects diverse argument strategies, which are key indicators for evaluating argument structure [16,21]. This research gap not only limits a comprehensive exploration of students’ argumentation skills but also hinders the further development of instructional strategies and interventions aimed at optimizing writing through argument structure.

The evidence indicates that the existing methods for argument components and strategy prediction still face challenges in automated argumentative writing analysis, and the limited exploration of LLMs further impedes the advancement of both practice and theory. Additionally, argument components, argument strategies, and writing quality are closely interrelated, but their complex relationships remain largely unexplored. To address these gaps, this study employed chain-of-thought (CoT) prompting and multi-task learning (MTL) methods to systematically evaluate the performance of LLMs in the argument component and strategy prediction tasks. Furthermore, we utilized learning analytics methods to investigate the relationships between these two elements and writing quality. To sum up, this research attempted to address the following research questions:

RQ1. How effectively do LLMs perform in identifying argument components and argument strategies in students’ argumentative writing?
RQ2. What are the relationships among argument components, argument strategies, and writing quality?

2. Related Work

2.1. Argument Components and Strategies in Student Essays

In the study of argumentative writing, scholars widely agree that the organizational structure of an article is critical for conveying ideas and strengthening arguments [22]. Traditional analytical frameworks typically consist of three sections: introduction, body, and conclusion [23]. The body, as the core of the essay, concentrates on the author’s analysis and argumentation of the thesis, as well as the presentation of supporting evidence. In recent years, researchers have increasingly focused on the organization of different argument components (e.g., claims, supporting evidence, and counterarguments) within essays [1,15]. Typically, authors use diverse argument components to strengthen their viewpoints. For instance, the major claim serves as the core element of argumentative writing for establishing the author’s stance, while supporting evidence, such as quotations and examples, enhances the rationality and credibility of the argument [21,24]. Studies have shown that, in high-quality argumentative essays, authors skillfully employ and organize various argument components, making the essay more persuasive and intellectually profound [1,16].

Argument strategies represent another critical dimension of argumentative writing. The core of argument analysis lies in understanding the content of argument chains, analyzing linguistic structures, and identifying the relations between argument units so as to uncover the argument structure of the text [25]. High-quality argumentation not only relies on the rational arrangement of claims but also requires the strategic use of various linguistic techniques to engage readers’ emotions and rationality, prompting reflection and interest [26]. For example, metaphorical argumentation helps to visualize abstract ideas, making arguments more vivid and persuasive, thereby facilitating readers’ comprehension and acceptance [27]; comparative argumentation highlights critical points by exposing similarities or differences between contrasting entities, thus enhancing the logicality and hierarchy [28]. Thus, understanding and mastering argument components and strategies is not only crucial for improving writing quality but also provides learners with a clear writing framework and actionable guidance for refinement.

Although these studies provide valuable insights into argumentative writing, certain research gaps remain. The existing research primarily focuses on sentence-level types and functions, overlooking the interactive relations between sentences and the impact of argument strategies on argument structure and quality [19,29]. Furthermore, no research has investigated the synergy between argument components and argument strategies. In fact, these two aspects are closely interconnected and mutually reinforcing, collectively shaping the argumentative process. Analyzing either in isolation fails to fully capture the critical factors affecting argumentative writing quality. Therefore, it is necessary to further examine the combined influence of argument components and strategies on writing quality, enabling a holistic understanding of the complexity of argumentative writing.

2.2. Automated Argumentation Analysis Techniques

Machine learning and deep learning methods have been increasingly applied to the automated analysis of student argumentative essays. Traditional machine learning approaches primarily rely on shallow linguistic features to classify argument components [30,31]. Although machine learning methods based on feature engineering offer interpretability, the feature construction process is time-consuming, labor-intensive, and heavily reliant on expert knowledge, with performance mainly determined by the quality of the extracted features. In contrast, deep learning methods are able to automatically capture complex linguistic features and deep semantic relationships through neural networks by simulating the way the human brain processes information, and they are regarded as advanced tools for data analysis in various fields. Several studies have demonstrated that the pretrained language model BERT significantly outperforms traditional machine learning classifiers in analyzing student argumentative essays [29,32].

The development of LLMs has brought significant advancements to computational argumentation. Researchers systematically evaluated the potential of LLMs in various argumentative scenarios through two types of tasks: argument mining and argument generation [33]. The experimental results demonstrate that LLMs perform well in most tasks and significantly narrow the gap with fine-tuned pretrained models in few-shot settings. Several studies have assessed LLMs’ performance in stance classification [34] and argument relation detection [35,36], finding that LLMs surpass fully fine-tuned traditional smaller models with only a few in-context examples. In the education domain, preliminary studies have highlighted LLMs’ significant potential in writing assessment and assistance, particularly in improving writing skills, enhancing learning experiences, and optimizing resources [37,38]. For example, several studies [16,24] demonstrated that LLMs significantly outperform smaller pretrained models in argument component identification and relation prediction tasks. However, the existing research primarily focuses on isolated exploration of argument mining subtasks, neglecting task interdependencies. Although there is a close relationship between argument component prediction and strategy prediction, a joint analysis of the two within the LLM framework remains unexplored. Additionally, research on LLMs in student argumentative writing remains scarce, which limits the in-depth understanding of their argument analysis capabilities and hinders further application and development in educational research and practice.

2.3. Large Language Models for Sustainable Education

Sustainable education aims to cultivate learners’ core competencies to address contemporary and future societal and environmental needs [39]. Deeply shaped by historical developments, cultural ideologies, and social contexts, educational systems in different countries exhibit notable variations. For instance, developed countries like the United States emphasize inquiry-based and student-centered teaching strategies, focusing on the development of students’ critical thinking and innovative abilities [40]. In contrast, China’s traditional education tends to be teacher-centered, prioritizing knowledge transmission and memorization, which, to some extent, restricts the development of students’ critical thinking [41].

With the advent of the digital age, technology has become increasingly pivotal in sustainable education. Artificial intelligence offers robust support for narrowing educational disparities and fostering student development through personalized learning platforms, virtual classrooms, and the digital integration of educational resources. For example, Jurišević et al. [42] explored the application of LLMs in education and the efforts in logical reasoning, critical thinking, and creative cognition. Park et al. [43] proposed a debate chatbot framework based on LLMs, achieving intelligent assessment of students’ critical thinking abilities. These studies highlight the potential of AI in advancing sustainable education. However, the application of LLMs in sustainable education remains in its exploratory stages, and further investigation is needed to determine how LLMs and AI technologies can contribute to the high-quality development of sustainable education.

3. Methods

3.1. Research Design

The research design consists of four phases (Figure 1): data collection, data preprocessing and annotation, model building and evaluation, and data analysis. Phase 1: Collect learners’ argumentative essay data in the final exam scenario, along with the corresponding writing scores and task requirements. Phase 2: Preprocess the data by converting scanned essay images into textual data, removing incomplete or invalid samples, and segmenting valid data into paragraphs and sentences. On this basis, annotate the argument components and strategies according to the coding scheme. Phase 3: Using the annotated dataset, develop, train, and evaluate several advanced Chinese open-source LLMs. This phase focuses on comparing the performance of these models under single-task learning (STL), chain-of-thought (CoT), and multi-task learning (MTL) approaches to address the first research question. Phase 4: Employ learning analytics (ENA [44] and two-mode network [45]) methods to investigate the relationships among argument components, argument strategies, and writing quality. This phase seeks to answer the second research question.

3.2. Research Data

The data was sourced from three high schools in eastern China, involving 228 first-year high school students who participated in an argumentative writing course. This course was designed to guide students in mastering the skills of expressing and arguing viewpoints, with dedicated exercises in argumentative writing. In the final Chinese language examination, argumentative essay writing served as the culminating task, requiring students to develop arguments on specific topics within time constraints. In China, high school education follows a unified national curriculum standard for all subjects, including Chinese argumentative writing teaching. There is no essential difference in teaching programs or training systems across all participating schools.

The original dataset consisted of scanned images of student exam essays, including the full text and scores assigned by teachers according to the Chinese National College Entrance Examination (Gaokao) scoring criteria (see Appendix A for detailed standards). Among 228 students, 2 failed to complete the writing task, resulting in 226 essays for analysis. The essays ranged from 557 to 1101 words in length, with an average of approximately 829.82 words. Figure 2 provides a detailed breakdown of score distribution, which is categorized according to the Gaokao scoring criteria. We adopted absolute scoring criteria for grouping: essays scoring within Categories I and II were classified as high-quality group, while those in Categories III, IV, and V were designated as low-quality group.

3.3. Coding Scheme

For each argumentative essay, we coded the argument structure from the perspectives of argument components and argument strategies.

3.3.1. Argument Component

The classic Toulmin model of argument [2] revolves around three key elements: a claim, or the assertion that needs to be argued; data that provide supporting evidence for the claim; and a warrant that explains how the data support the claim. On this basis, the assertions are further subdivided into major claim, claim, and restated claim according to the importance and position. To gain a more comprehensive understanding of the sources and attributes of evidence, we further categorize evidence into five types: fact, anecdote, quotation, proverb, and axiom. Elaboration denotes the further presentation, explanation, or analysis of assertions or evidence. Additionally, other type is introduced to represent non-argument sentences.

Overall, we defined 4 coarse-grained and 10 fine-grained types of argument components. More details are referred to in Appendix B.1.

3.3.2. Argument Strategy

Based on previous research [16], we annotate strategies from the perspectives of vertical argument relations and horizontal discourse relations.

The vertical dimension explores the relationships among different types of argument components to reveal the internal logic and reasoning chains. To characterize the interactions among these components, this study defines 10 types of argument relations encompassing three perspectives. Specifically, we categorize three stance-based argument relations: Positive, Negative, and Comparative argumentation. Aligned with educational practice, two evidence-based argument relations are identified: Example and Citation argumentation. To thoroughly gain insight into the argument process, we incorporate discourse analysis theory and establish five discourse-based argument relations. Specifically, inspired by the Rhetorical Structure Theory (RST) framework [46], we propose three additional categories: Background, Detail, and Restatement Relation. Following Walton’s scheme [47], Hypothetical Argumentation is included, which holds notable relevance in Chinese argumentative contexts. Furthermore, recognizing the significance of metaphoric rhetoric, we introduce Metaphorical Argumentation as a distinct type of argument relation.

The horizontal dimension investigates the relationships among components belonging to the same category, aiming to analyze the interactions and cooperation between elements within the same level from a more global perspective. Building on previous research [16], we employ four discourse relations: Coherence, Progression, Contrast, and Concession, to characterize the logical transitions that occur between arguments of the same type.

Overall, we defined 14 fine-grained relations across two dimensions to capture various argument strategies in essays. For detailed information, see Appendix B.2.

For a detailed example of the essay annotation results, see Figure 3.

3.3.3. Coding Process and Result

In Phase 2, we coded argument components and strategies of the argumentative essays. Referring to previous studies [21,24], we annotated argument components at the sentence level. The annotation team consisted of six coders and two domain experts. Each essay was independently annotated by two annotators, with arbitration provided by experts. Due to the complexity of the annotation, the process is divided into two steps: first, annotate the types of argument components in the sentences; and then annotate the strategic relations between components based on this. The Cohen’s kappa coefficients for the annotation reliability of argument components and strategies were 0.76 and 0.68, respectively. Taking into account the difficulty and complexity of the coding tasks, this is reasonable and has relatively high consistency [48,49]. Notably, prior to strategy annotation, two experts annotated the boundaries of sentence-level argument components (Cohen’s kappa 0.95) to address the issue of identifying argument units composed of consecutive identical components. The final dataset consists of 226 Chinese argumentative essays, with a total of 4726 argument components and 4837 argument strategies. Detailed distribution can be found in Figure 4 and Figure 5.

3.4. Automated Classification of Argument Component and Strategy Using LLMs

In Phase 3, we systematically evaluated the performance of LLMs in argument component and argument strategy prediction tasks. It has been shown that LLMs exhibit significant advantages in argument mining tasks compared to smaller pretrained language models [16,36]. However, the existing studies primarily employed basic supervised fine-tuning within single-task learning (STL) frameworks, which only provides preliminary insights into LLMs’ argument parsing capabilities while neglecting the intrinsic connections between the two tasks. In argumentative writing, argument components form the foundation of relations and strategies, and the interaction strategies between components are closely tied to their types. Based on this, we innovatively investigate LLMs’ argumentative reasoning capabilities from three dimensions: STL baselines, CoT prompting, and MTL methods. In addition, we provide zero-shot results of all evaluated models as a reference for performance comparison.

The argument component prediction task aims to detect and classify all potential components in argumentative essays. Following previous research [16], we formulate it as a sentence-level classification task and use BIO tagging to represent structural span information (details of BIO provided in Appendix C.1). In implementation, we prepend each sentence with a special token #ID as a sequence identifier and employ the sentence #ID along with type sequences as generation targets to achieve automatic prediction. The argument strategy prediction task aims to detect and classify all potential relations between argument components (note: multiple relation types may exist between a single pair of components). Considering the generative nature of LLMs and the requirements for subsequent MTL, we define this task as a fixed-pattern sequence generation problem, with sentence #IDs and relation types as output targets. Notably, we developed a systematic prompt engineering framework that integrates role settings and domain knowledge to construct instruction fine-tuning data, aiming to improving the LLMs’ ability to accurately identify components and strategies. We take argument component classification and argument strategy prediction under STL settings as baseline models. For detailed instruction information, refer to Appendix C.2.

For the CoT prompting, we designed a two-stage reasoning guidance mechanism. By adding reasoning path prompts in natural language form, it guides LLMs to gradually deduce argument components and strategies in essays, thereby exploring the capabilities of LLMs in simulating the human process of step-by-step argument analysis (see Appendix C.3 for complete prompt). In the MTL framework, we constructed multi-task training data based on the original two sets of single-task training data combined with special task token markers, facilitating knowledge transfer between tasks. The core of MTL lies in simultaneously learning both task-shared and task-specific representations, enhancing the generalization and argument reasoning of LLMs (see Appendix C.4 for detailed data).

All experiments employed supervised fine-tuning based on LoRA technology [50] to evaluate the performance of multiple leading open-source Chinese LLMs. Specifically, we utilized DeepSeek-R1-Distill-Qwen-7B, Qwen3-8B-Base, and ChatGLM-4-9B-Base, which are known for superior performance on Chinese language tasks.

3.5. Data Analysis

To address the first research question, we evaluated the performance of LLMs on argument component and strategy prediction tasks. A total of 226 annotated essays were initially divided into training, validation, and test sets through random sampling at a ratio of approximately 8:1:1. Subsequently, manual adjustments were performed based on statistical analysis of each subset to ensure balanced representation of all component and strategy types across subsets while maintaining proportional consistency. Detailed distribution can be found in Figure 6. For model training configurations, we adopted AdamW optimizer with the learning rate of 5

e^{- 5}

to update the model parameters and set batch size to 1. We employed LoRA rank of 8 and the dropout rate of 0.1 across all training sessions. All other hyperparameters were initialized with the default values. All experiments were conducted on a single NVIDIA RTX 3090 GPU (Santa Clara, CA, USA). For evaluation metrics, we used Micro-Precision, Micro-Recall, Micro-

F_{1}

, and Macro-

F_{1}

for argument component prediction. A correct prediction was defined as a complete match with the gold standard in both boundaries and category labels. For argument strategy prediction, we employed Micro-Precision, Micro-Recall, Chunk-

F_{1}

, and Sentence-

F_{1}

(all calculated at the micro level). Chunk-

F_{1}

quantifies the model’s ability to precisely identify argument strategies, requiring exact alignment of both argument component boundaries and relation types. In contrast, Sentence-

F_{1}

adopts a more lenient sentence-level alignment, allowing for partial boundary discrepancies. All experiments were run three times, with average results reported.

To address the second research question, network-based learning analytics were utilized to visualize the argument structures, enabling a deeper exploration of the development and changes in argumentative elements across the two groups. Specifically, ENA was applied to compare the argument components or strategies in high- and low-quality essays, while a two-mode network was used to analyze the connection networks that involve both components and strategies. ENA focuses on the independent analysis of argument components or strategies, whereas the two-mode network emphasizes the integrated analysis of both, aiming to systematically reveal the characteristics and differences in argumentative writing across performance levels. Further information on ENA and two-mode network is provided in Appendix C.5.

4. Results

4.1. Empirical Comparison of Leading LLMs in Identifying Argument Components and Strategies

Table 1 presents the performance of the different LLMs on the argument component prediction task using four methods: zero-shot, STL, CoT, and MTL. Overall, the MTL method significantly outperforms STL and CoT in precision, recall, and Micro-

F_{1}

, highlighting its advantage in identifying argument components. This superiority mainly stems from MTL’s ability to enhance the semantic understanding of argument structures by sharing underlying feature representations, as well as leveraging data augmentation and information complementarity between tasks to improve model performance. Specifically, as a related task, argument strategy prediction aims to identify argument component pairs and their relations, providing strong support for argument component prediction. Notably, fluctuations in the Macro-

F_{1}

metric may be attributed to the imbalanced distribution of component categories in the dataset. In terms of model performance, ChatGLM (9B) achieved the best results, followed by Qwen (8B), with DeepSeek (7B) performing relatively worse. This reflects the significant differences in performance among the different models. Furthermore, the performance gains from MTL diminish as the model parameters increase, indicating a nonlinear relationship between model capacity and the efficacy of MTL. This may be attributed to larger models already possessing robust capabilities, leaving narrower optimization margins for MTL.

Table 2 reports the performance of various LLMs on the argument strategy prediction task using four methods: zero-shot, STL, CoT, and MTL. Overall, the CoT method significantly outperforms STL and MTL across all the evaluation metrics, demonstrating its advantage in identifying argument strategies. This superiority mainly stems from CoT prompting’s two-stage reasoning mechanism, which mimics human cognitive processes: first guiding LLMs to identify argument components within essays and then analyzing the strategic relations between these components. This structured reasoning pathway markedly enhances the models’ ability to handle complex argumentation. In terms of model performance, ChatGLM (9B) once again achieved the best results, followed by Qwen (8B), with DeepSeek (7B) performing the worst. This further reinforces ChatGLM’s outstanding performance in Chinese argument analysis. Notably, all the models exhibited consistent and significant improvements under the CoT method, reinforcing its generalizability and effectiveness in automated argument strategy analysis.

To further validate the effectiveness of the proposed CoT and MTL methods, comparative experiments were conducted on models of different scales. Figure 7 presents the performance results of the Qwen3 series models (0.6 B, 1.7 B, 4 B, and 8 B) on the two tasks. The experiments demonstrate that both the CoT and MTL methods significantly enhance model performance, with more pronounced improvements observed in smaller-scale models. Additionally, the overall results are consistent with the earlier experiments: the MTL method achieves optimal performance on the argument component detection task, while the CoT method exhibits greater advantages in the argument strategy prediction task. Furthermore, model performance shows a steady improvement as model size increases. This aligns with the Scaling Laws [51], which state that larger models generally exhibit better performance.

4.2. The Relationships Among Argument Components, Strategies, and Writing Quality

The ENA graph illustrates the structural differences of sentence component types between high- and low-quality argumentative essays (see Figure 8a). Compared to low-quality essays, high-quality essays exhibit a more diverse structure of argument components, characterized by a co-occurrence network centered around Claim and Quotation. This indicates that, in high-quality essays, Claim and Quotation play a central role as they frequently co-occur with other argument components, such as Major Claim, Fact, and Elaboration. Specifically, argument component structures like Claim-Quotation, Claim-Fact, Claim-Elaboration, Claim-Major Claim, Quotation-Fact, and Quotation-Major Claim are more prominent in high-quality essays. This suggests that high-quality essays construct Claims from multiple perspectives by leveraging diverse argument components, supported by rich Quotations and Fact evidence, along with detailed Elaboration, thereby significantly strengthening the logicality and persuasiveness of the argumentation. In contrast, low-quality essays exhibit a simpler structure, with Others serving as the central argument component, forming strong connections, such as Others-Major Claim, Others-Fact, and Others-Elaboration. This indicates that low-quality essays contain more irrelevant content and fail to thoroughly substantiate Major Claim from multiple perspectives, resulting in weaker argument quality. Overall, high-quality essays demonstrate more complex and integrated use of diverse components, while low-quality essays show limitations in developing and integrating such components or even deviate from the argument theme.

Additionally, the ENA results reveal structural differences in the use of argument strategies between the high-quality and low-quality groups (see Figure 8b). While both groups display a certain level of structural complexity, differences are observed in the types of strategies interacting between arguments. High-quality essays present a more diverse and interconnected strategic network, with central strategies such as Concession, Progression, and Background. These central strategies are often combined with Positive, Example, and Detail argumentation, which reflect the characteristics of logically progressive argumentation. This suggests a preference for building arguments through logical progression, where claims are supported with positive evidence, further elaborated using background information, and strengthened with detailed explanations, while concessive strategies are employed to enhance critical depth. In contrast, low-quality essays tend to rely more on Coherence and Negative strategies, which are frequently linked to Positive, Example, and Detail strategies. This reflects a tendency to construct arguments based on straightforward parallel logic, such as simply balancing opposing views or listing evidence, without deeper structural development. Overall, high-quality essays demonstrate greater strategic variety and hierarchical complexity, whereas low-quality essays exhibit simpler and more linear argumentative structures, which limits the depth and effectiveness of argumentation.

Furthermore, the two-mode network compares the connections between high- and low-quality essays in terms of both argument components and strategies (see Figure 9). Notably, Claim occupies a central position in both groups of essays, consistent with the foundational principle of argumentative writing that employs claims to substantiate the major claim. However, high-quality essays place emphasis on Claim, Elaboration and Quotation as the core components of an argument, which are closely connected with diverse strategies such as Positive, Concession, Progression, Background, Example, and Citation, showcasing a rich and complex argument structure. On the contrary, low-quality essays focus on Claim, Major Claim, and Restated Claim as the core components of an argument, primarily linked with strategies such as Example, Negative, and Contrast, presenting a repetitive and simplistic argumentative approach. The disparity underscores the importance of integrating varied argument components and strategies to construct compelling arguments, whereas the lack thereof in low-quality essays leads to diminished persuasiveness due to their monotonous and insufficiently developed arguments.

4.3. Case Study of LLM Prediction Results

Taking the essay in Figure 3 as an example, we conducted a comparative analysis of the performance of the Qwen3-4b model under different methods. As shown in Table 3, after introducing the CoT and MTL methods, the model’s ability to identify argument components and predict relations was significantly improved. This outcome aligns with the findings of previous experiments, demonstrating that CoT and MTL methods can effectively enhance the model’s performance in the task of automatic argument structure parsing.

5. Discussion

This study systematically evaluates the effectiveness of leading LLMs in identifying argument components and strategies in essays while investigating the relationships among argument components, argument strategies, and writing quality. This section will discuss the key findings, provide insights for argumentation instruction and the sustainable development of education, and propose directions for future research.

5.1. Automated Classification of Argument Components and Strategies Using LLMs

Argumentative writing plays an important role in cultivating students’ comprehensive abilities and supporting future development. However, in actual teaching practice, faced with the high volume of student essays, it is difficult for teachers to provide timely detailed analysis and personalized guidance. This study developed three methods, STL, CoT, and MTL, to systematically investigate the capabilities of advanced LLMs in automatically parsing student argumentative writing. The experimental results demonstrate that the CoT method, by simulating the two-stage cognitive process of human argumentative reasoning, effectively enhances the models’ ability to handle complex argumentative reasoning tasks, achieving the best performance in argument strategy prediction tasks among all the LLMs. Additionally, the MTL framework facilitates effective knowledge transfer through joint learning of argument components and strategy prediction, achieving optimal performance in most metrics for argument component prediction tasks. These findings align with previous studies [52,53], underscoring the synergistic effects of multi-task learning in natural language processing tasks. Notably, the models’ performance in strategy prediction is relatively suboptimal, which may be attributed to the inherent complexity of the task and the limited scale of the training data. Future research could consider expanding the data scale or developing data augmentation methods to further improve the models’ argumentative analysis capabilities.

5.2. Variations in Argumentation Between Different Writing Qualities

The comparison between high-quality and low-quality essays in argument component structures and argument strategy structures reveals significant differences in complexity and diversity. Firstly, high-quality essays demonstrate a strong ability to integrate argument components, centering around Claim and Quotation while frequently connecting with others, such as Elaboration and Fact, to form intricate argument component structures. This aligns with the previous research [22], which emphasized the importance of diverse components in constructing persuasive arguments. In contrast, low-quality essays rely on simpler structures, centered around Others, with limited integration between diverse components, resulting in fragmented and biased arguments. Secondly, high-quality essays employ more diverse and sophisticated argument strategies, such as Concession, Progression, and Background, which enhance the flexibility and criticality of argumentation. Low-quality essays, however, primarily rely on basic strategies like Coherence and Negative, whose straightforward parallel logic restricts the depth and effectiveness of argumentation. Finally, the connections between argument components and strategies further highlight these differences. High-quality essays exhibit rich and intricate connections, such as Claim connected to Progression and Background, constructing multidimensional and layered reasoning. Conversely, low-quality essays feature repetitive and simplistic connections, primarily relying on Claim, Major Claim, and Restated Claim to reinforce argumentation, lacking deeper logical support. These findings further validate previous studies [1,22], indicating that effective academic argumentation depends on the integration of diverse argument components and strategies.

5.3. Theoretical and Practical Implications

These findings have significant theoretical and practical implications. First, they shift the analytical lens from traditional sentence-level classification to a structural view of argumentative writing. By adopting network-based approaches, the study underscores how the co-occurrence and connection of argument components and strategies contribute to the overall coherence and persuasiveness of a text. This perspective challenges linear models of argumentation and highlights the need to understand argument quality as an emergent property of structural complexity. Second, the findings are generated through data-driven analysis rather than fixed scoring rubrics, allowing the identification of organic patterns that characterize high-quality writing. This distinction is important because it emphasizes discovery rather than conformity, provides a more authentic representation of student reasoning processes, and reveals features that rubrics may neglect. Third, the study contributes methodologically by integrating LLMs and network analysis into argumentation research. This fusion opens up new possibilities for analyzing large-scale student writing data with greater precision and depth, helping to build a foundation for scalable evidence-based research on argumentation in different educational settings.

In practice, the findings provide meaningful guidance for writing instruction and the design of AI-supported educational technologies. First, by identifying key argument structures of high-quality essays, such as frequent use of quotations, elaboration, and advanced strategies like concession and progression, educators can better understand what to emphasize in writing instruction. These insights help to shift the instructional focus from surface features (e.g., grammar or length) to deeper argumentative functions, promoting higher-order thinking. Second, this study demonstrates the practical potential of combining LLMs with network analysis to support argumentative writing. LLMs can automatically identify argument components and strategies in student argumentative essays, while network analysis helps to uncover argument structures. The system can compare a student’s argument structure with typical argument structures from high- and low-quality essays to generate targeted formative feedback [29]. This feedback helps students to better understand and optimize their argument structures, thereby enhancing writing quality. Third, the findings highlight the need to rethink existing assessment practices in argumentative writing. Traditional scoring often focuses on surface-level accuracy or isolated content features, which may overlook deep structural qualities. By introducing structural analysis into assessment, educators and institutions can develop more nuanced evaluation systems that capture the complexity of students’ reasoning patterns. This can complement human grading by offering insights into how students construct arguments, enabling more equitable and formative assessments that align with the goals of higher-order learning and promote sustainable education.

5.4. Limitations and Future Directions

Several limitations should be noted. First, the dataset used in this study is relatively limited, and the distribution of various types is unbalanced, which may restrict the model’s ability to learn information and the representativeness of the evaluation results in specific domains. To this end, future research could improve the model’s learning effectiveness by expanding the dataset, balancing category distribution, or leveraging LLMs to generate more high-quality supplementary data. Second, the analysis focused on the final written products, overlooking the underlying cognitive and regulatory processes involved in argument construction, such as planning, monitoring, and revision. This limits the understanding of how the argument components and strategies emerge and evolve during the writing process. Future research should incorporate process-tracing data, such as keystroke logs, screen recordings, or think-aloud protocols, to capture the temporal dynamics of argumentative writing and reveal how learners regulate and refine their ideas over time. Finally, while this study offers valuable insight into the argument structure of argumentative writing through advanced analytic methods, it remains a post hoc analysis and lacks empirical validation in real instructional settings. Future research should design and implement controlled intervention studies that translate these findings into pedagogical strategies in order to evaluate how targeted instruction informed by automated analysis can support the development of students’ argumentative skills in authentic classroom environments. Overall, this study provides a scalable methodological framework for the automated identification and structural analysis of argumentative writing while establishing a solid foundation for advancing AI-driven assessment and personalized instruction in the field of education, contributing to the sustainable development of education.

6. Conclusions

Given the importance of argumentative writing in education, this paper systematically investigates the argument analysis capabilities of advanced LLMs using three approaches: single-task learning (STL), chain-of-thought (CoT), and multi-task learning (MTL). The goal is to automatically identify argument components and strategies in student essays and to investigate their relationship with writing quality. The research dataset is derived from a high school examination context and consists of 226 argumentative essays, containing a total of 4726 argument components and 4837 argument strategies. The results demonstrate that the CoT approach enhances LLMs’ argument reasoning by simulating the two-stage cognitive process of human argumentation, while the MTL method facilitates effective knowledge transfer through joint learning of component and strategy prediction. Furthermore, learning analytics reveal significant differences between essays of varying proficiency levels. High-quality essays exhibit stronger integration of argument components, more diverse and sophisticated use of argument strategies, and deeper logical structures. These findings provide both theoretical insights for automated argument analysis and practical implications for supporting writing instruction, offering technological support and innovative pathways for advancing the sustainable development of education.

Author Contributions

Conceptualization, Y.R. and N.Z.; methodology, Y.R. and N.Z.; formal analysis, Y.R. and N.Z.; investigation, Y.R.; data curation, Y.R. and Y.C.; writing—original draft preparation, Y.R. and N.Z.; writing—review and editing, X.L., Y.Z. and Y.C.; supervision, M.L.; project administration, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial College Student Science and Technology Innovation Program, grant number 2025R401177.

Institutional Review Board Statement

This study was waived for ethical review as it only adopts anonymous student essay data without involving any personal privacy or human intervention by institutional committee.

Informed Consent Statement

Informed consent for participation was obtained from all subjects involved in the study.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. More Details of Research Data

The Chinese National College Entrance Examination (Gaokao) follows a unified scoring rubric, and the detailed essay evaluation criteria are presented in Table A1.

Table A1. The Chinese National College Entrance Examination essay scoring criteria.

Level	Scoring Criteria
I (63–70)	Accurately comprehends the material, appropriate perspective, profound thesis, prominent central idea, substantial content, genuine emotion, rigorous structure, originality, and literary merit.
II (52–62)	Basically correct understanding of the material, relatively appropriate perspective, relatively profound thesis, clear central idea, relatively substantial content, authentic emotion, complete structure, and fluent language.
III (39–51)	Fairly comprehends the material, fairly appropriate perspective, general thesis, fairly clear central idea, fairly substantial content, fairly authentic emotion, basically complete structure, basically fluent language, with occasional grammatical errors.
IV (21–38)	Deviates from the material, inappropriate thesis or perspective, unclear central idea, thin content, incomplete structure, inadequate language fluency, and frequent grammatical errors.
V (0–20)	Completely off-topic from the material, incoherent writing, and total word count less than 400 words.

Appendix B. More Details of Coding Scheme

Appendix B.1. Argument Component

Based on Toulmin’s argument model, this study integrates factors such as the importance of viewpoints and the sources and attributes of evidence to establish a classification system for argument components, comprising four coarse-grained and 10 fine-grained categories. The specific categorizations, definitions, and examples are detailed in Table A2.

Appendix B.2. Argument Strategy

By deeply integrating argument relations with discourse relations, we propose 14 fine-grained relation types from both vertical and horizontal dimensions, thereby capturing the intricate interplay between argument components for a thorough understanding of argument strategy and structure. Detailed definitions and examples of argument relations in the vertical dimension are presented in Table A3, and discourse relations in the horizontal dimension are presented in Table A4.

Table A2. A list of argument component types, their descriptions and samples [24].

Coarse	Fine	Definition	Example
Assertion	Major Claim	The theme or thesis of an article, i.e., the most significant point that the author aims to convey and argue.	Life needs a sense of ritual because it can counter mediocrity.
	Claim	Supporting ideas or subsidiary claims articulated around the major claim.	In my opinion, life needs a sense of ritual, but not blindly pursued.
	Restated Claim	A restatement or rephrasing of an already stated Major Claim or Claim, for the purpose of emphasis or clarification.	Life needs a sense of ritual, but can not blindly pursue, the continuous pursuit and progress, lively and vivid, this is life.
Evidence	Fact	Specific cases, generalized facts, and reliable historical events, etc.	Regrettably, in today’s society, many have fallen into the trap of exaggerating their sense of ritual to fulfill short-lived material satisfactions and the envy of others, leading to chaos in their personal lives. In pursuit of luxury, they spare no expense, ultimately trading for nothing but emptiness and stress.
	Anecdote	Experiences from oneself or from friends and family.	And on our own part, we may have let our nerves get in the way of our performance in the exam or put ourselves under a lot of unnecessary stress.
	Quotation	Citing others’ writings, research, ideas or theories.	The ground is all sixpence, there is always someone to look up to see the moon.
	Proverb	Sentences or phrases that are widely circulated among the populace, carrying educational value or reflecting social experience.	Without rules, nothing can be accomplished.
	Axiom	Recognized common sense or scientific axioms or laws.	In addition to this, the theoretical knowledge of science has become synonymous with authority in most cases, a simple example, no would argue that 1 + 1 does not equal 2.
Elaboration	-	Explanation, analysis, or discussion of the assertion or evidence, providing detailed clarification or establishing the connection between arguments.	Life needs to be down-to-earth, but if you always keep your head down to earn that tiny “sixpence”, and forget to look up to appreciate the bright “moon”, just in the mediocrity of the numbness of the self, to become a zombie, what is the meaning of life?
Others	-	None of the above, i.e., non-argument components within argumentative essays.	May the wind guide our path.

Table A3. A list of argument relations in the vertical dimension, their descriptions and samples [16]. Argument component types are indicated in blue, with the argument before and after the → corresponding to the source argument and target argument, respectively. It is noteworthy that multiple argument relations may exist between argument pairs and occur between argument components of different types.

Aspect	Label	Definition	Example
Stance-Based	Positive	A method that directly validates the correctness of a viewpoint by using elaboration or evidence consistent with the viewpoint to support it, emphasizing direct affirmation of the viewpoint.	Quotation: Nietzsche once said, “Every day that you do not dance is a betrayal of life.” → Claim: Exploring the spiritual world is an individual’s journey of self-awareness—a process of exercising subjective initiative to recognize one’s own uniqueness.
	Negative	A method that indirectly proves the correctness of a viewpoint through elaboration or evidence that are contrary to the viewpoint. It emphasizes the negation of opposing viewpoints, thereby achieving the purpose of the argumentation.	Quotation: As Shakespeare said, “Without surprises, life would have no luster.” → Claim: Under a certain sense of ceremony, people can become more passionate about life, helping them cherish the moment and look forward to the future.
	Comparative	A shorthand for positive and negative argumentation, is an argumentative approach that involves contrasting and comparing two items to highlight their differences, thereby making the conclusion more evident and persuasive.	Fact: Take the recent marathon as an example: many contestants did not finish the race, some even quitting midway. This occurred because one runner started accelerating early on, prompting others not to fall behind, a manifestation of tension. Conversely, those who maintained their composure and were undisturbed ended up securing better positions, illustrating the benefits brought by a sense of relaxation. → claim: In real life, we need a sense of relaxation more than tension.
Evidence-Based	Example	An argumentation method that proves a thesis through concrete, or typical examples.	Fact: The flourishing Tang Dynasty, despite its grandeur, is reduced to fleeting pages in historical records. Without ritualistic significance and the poetic brilliance of Li Bai, Du Fu, and others, how could we today appreciate the splendor of ancient Chang’an or comprehend the complex emotions embedded in phrases like ’returning to Chang’an as one’s homeland’? → Claim: Ritualistic significance adds brilliance to mundane life, liberating individuals from mediocrity in that moment and infusing dull emotions with romantic yearning for beauty.
	Citation	An argumentation method that proves a thesis by using quotations or axioms.	Quotation: Nietzsche once said, “Every day that you do not dance is a betrayal of life.” → Claim: Exploring the spiritual world is an individual’s journey of self-awareness—a process of exercising subjective initiative to recognize one’s own uniqueness.
Discourse-Based	Metaphorical	By employing metaphorical rhetoric, familiar things are used as metaphors to argue the correctness of a viewpoint. In drawing parallels between two items with similar characteristics, the artful use of metaphors often serves to better elucidate concepts, making the argument more vivid and interesting.	Elaboration: If understanding objects is likened to baking a cake, then the method of comprehension is the mold. Those who only heed the words of authoritative experts apply others’ molds; thus, no matter how sweet the resulting cake is, it will not be in a shape that suits them. → Claim: A deep-rooted reliance on authoritative experts also reflects a more profound issue—a lack of fundamental methods for understanding things oneself.
	Hypothetical	Analyzing evidence from the opposite side based on hypothesis to infer its authenticity and reliability, thus robustly supporting a thesis.	Fact: The grandeur and brilliance of the Tang Dynasty, though but a fleeting mention in the annals of history, would be lost to us without the ceremonial gravitas and the exquisite verses of poets like Li Bai and Du Fu. How else could we, in the present day, glimpse the golden splendor of ancient Chang’an or grasp the myriad emotions encapsulated in the phrase “Returning to Chang’an, my homeland” ? → Claim: Ceremony adds a luster to the mundane, lifting those numbed by the monotony of daily life out of their mediocrity, infusing their arid emotions with a romantic yearning for the beautiful.
	Restatement	For argument of the type restated claim, its relation with the target argument (major claim or claim) is defined as restatement relation.	Restated Claim: Rituals are never unnecessary or superfluous. → Major claim: In life, rituals are just so indispensable.
	Detail	When an argument (elaboration type) primarily aims to further explain or analyze other content, it establishes a detail relation with the corresponding argument (assertion or evidence type).	Elaboration: Nietzsche’s words actually tell us to know thyself and become thyself, which all but maps out the exploration of the spiritual world of self. → Quotation: Nietzsche once said, “Every day that you don’t dance is a failure of life.
	Background	When an argument (elaboration type) primarily serves the function of introducing background, it constructs a background relation with the corresponding argument (assertion or evidence type).	Elaboration: It’s just that is such a mode of exploration really beneficial to people’s perceptions?” → Claim: This process of transformation essentially reflects the expansion of instrumental rationality and people’s active abandonment of “thinking”.

Table A4. A list of discourse relations in the horizontal dimension, their descriptions and samples [16]. Argument component types are indicated in blue, with the argument before and after the → corresponding solely to the order in which the two arguments appear in the essay. It is noteworthy that the discourse relation between argument pairs is singular and occurs between argument components of the same type.

Label	Definition	Example
Coherence	Describing several aspects of the same event, related events, or contrasting situations that coexist, co-occur, or oppose in meaning. These aspects can be reordered without altering the overall significance.	Fact: The idea of a commonwealth of nations, as proposed by Confucius, is also what we aspire to nowadays. → Fact: Another example is Wang Mang’s seizure of power and his promulgation of a series of new measures, which were denied at the time, but in fact he referred to Western countries for these initiatives.
Progression	The subsequent argument represents an advance in scope or meaning than the preceding one, intended to emphasize a deepening, expansion, or reinforcement of logic, and the order of the arguments is usually non-interchangeable.	Claim: However, the negative impacts caused by the pursuit of rituals are not few. → Claim: Only by getting rid of the solidified idea that a sense of ritual is necessary in life can they focus on the abundance of the spiritual world and climb higher.
Contrast	Comparison and selection are made by examining the similarities or differences between two or more things, situations, or viewpoints, emphasizing the contrast between them.	Fact: We all know that Wei Liangfu improved the Kunqu opera, leaving brilliant cultural treasures for future generations, we all know that Yuan Longping broke through a technical barrier to solve the food problem in many areas, they are not precisely in the ancients and the authority of the forefathers under the influence of their own chapter? → Fact: There are great men, naturally, there are also small people, those so-called good learning in fact, “thick ancient and thin” academic molecules, those who listen to the authority of the scientific molecules do not understand the development of adaptability, which one has made achievements?
Concession	An argument posits a certain situation or viewpoint, followed by a shift where the subsequent argument presents an opposing or contrasting perspective, emphasizing the content of the latter argument.	Claim: Therefore, while inheritance is important, breakthroughs and development are also indispensable. → Claim: However, should those ideas and factors that have been tested be recognized in their entirety? No.

Appendix C. More Details of Methods

Appendix C.1. Concept of BIO

The BIO tagging scheme is a standard representation method for sequence labeling, widely used in natural language processing tasks. BIO stands for: B: Begin, indicating the beginning of a tagged sequence; I: Inside, indicating the middle part of a tagged sequence; O: Other, representing components outside the tagged sequence. Our study defines the argument component prediction as a sentence-level sequence labeling task, where B is used to mark the beginning sentence of the argument unit and I represents the middle sentences. Combined with component labels, it is possible to achieve the identification of types and boundaries for argument components. A detailed example is shown in Figure A1.

Figure A1. BIO tagging example: Each sentence sequentially labeled by #ID in the essay. The red annotations indicate the BIO tags for the corresponding sentences. In the example, sentence #9 forms an argument unit of the Claim type. Sentences #10 and #11 together form an argument unit of the Fact type, with both focusing on the theme of China’s development journey.

Appendix C.2. STL Prompt Data

Under the STL method, examples of instruction-tuning data construction for the argument component prediction and argument strategy prediction tasks are shown in Table A5 and Table A6, respectively. The Instruction includes detailed prompt designs for each task, comprising three core components: model role designation, expert knowledge introduction, and output constraints. The Input represents argumentative essay data, with each sentence labeled using the #id format to facilitate subsequent target generation. The Output corresponds to the target output sequence for each task.

Appendix C.3. CoT Prompt Data

Referring to the process of human understanding and annotating argument strategies, CoT adopts a two-stage reasoning mechanism. First, it guides LLMs to identify argument components in essays, and then it analyzes the argument strategies implied among the components. The construction of training data is presented in Table A7.

Appendix C.4. MTL Prompt Data

The MTL method promotes knowledge transfer and enhances the performance of LLMs across various subtasks by jointly learning component and strategy prediction tasks. The construction of training data is presented in Table A8.

Appendix C.5. Concept of ENA and Two-Mode Network

Epistemic Network Analysis (ENA) is a prominent research method in learning analytics, widely used to model relationships in collaboration, learning, and cognitive activities [54]. ENA captures the co-occurrence structure within discourse data to visualize the connections between different elements, helping researchers to understand complex cognitive and interaction processes [44]. In ENA, nodes represent distinct concepts, behaviors, or topics. In our study, nodes represent specific argument components or argument strategies, while edges connecting the nodes indicate their co-occurrence relationships. The thickness of the edges reflects the frequency and strength of co-occurrence, with thicker edges indicating stronger relationships.

A two-mode network is a specific type of network structure in which nodes can be divided into two disjoint sets, U and V (i.e., two distinct types of nodes), and edges (connections) can only exist between nodes from different sets [55,56]. There are no direct connections between nodes within the same set. In this study, nodes consist of two types: argument components and argument strategies. The two-mode network represents connections between these two types of nodes, with no connections existing between component nodes or strategy nodes. When constructing the network, argument components are connected indirectly via argument strategies, forming a component–strategy–component structure rather than being directly linked to each other.

Table A5. Data example for argument component prediction task of the STL method.

Instruction	Input	Output
You are an experienced high school Chinese language teacher. Please analyze the following argumentative essay and determine the argument component type for each sentence. The argument component types include Major Claim, Claim, Restated Claim, Fact, Anecdote, Quotation, Proverb, Axiom, Elaboration, and Others. Multiple consecutive sentences of the same type may form a argument unit. Combine BIO tagging to indicate the boundaries of the corresponding argument units. Please note that only output the sentence number “#ID” and the corresponding B/I-component type.	[Essay Title]: Seeking Paths Through Flowers, Straight to the Depths of White Clouds [Essay Content]: #1 I believe that in the pursuit of noble aspirations and grandeur, one must pay attention to the sense of ritual, though overindulgence in this can lead to a distorted self-perception, letting clouds obscure one’s vision and covering the present with a veil of emptiness. #2 Rituals give a sense of ceremony, making ordinary actions more refined and precise, undertaken with greater care and seriousness, presenting an air of dignity and extraordinariness to others. #3 This is not inherently a derogatory term, as rituals can bring happiness to those who don’t normally engage in such practices. However, how much of this happiness stems from external validation, and how much is rooted in genuine self-discipline, requires further exploration. #4 Certainly, rituals can bring one short or lasting joy, but Camus once warned, “If noble actions are overly exaggerated, they may ultimately become an indirect yet powerful ode to sin.” ……	#1: B-Major Claim, #2: B-Elaboration, #3: I-Elaboration, #4: B-Quotation, ……

Table A6. Data example for argument strategy prediction task of the STL method.

Instruction:
You are an experienced high school Chinese language teacher. Please analyze the following argumentative essay and determine the types of relations between the argument components.
The argument component types include Major Claim, Claim, Restated Claim, Fact, Anecdote, Quotation, Proverb, Axiom, Elaboration, and Others. Multiple consecutive sentences of the same type may form a argument unit. The relations between argument components can be categorized as follows: based on whether the content directly supports the assertion or indirectly strengthens it by addressing opposing views, the relations can be divided into “Positive Argumentation”, “Negative Argumentation”, and “Comparative Argumentation”. These commonly occur between the Assertion (Major Claim, Claim, and Restated Claim) and Evidence (Fact, Anecdote, Quotation, Proverb, and Axiom), Claim and Major Claim, as well as between Elaborations and Assertions. Based on the type of evidence, the relation can be classified into “Example Argumentation” and “Citation Argumentation”, which appear between Evidences and Assertions.

Based on the rhetorical methods used in elaboration components, they can be categorized into “Metaphorical Argumentation” and “Hypothetical Argumentation”, which typically occur between Elaborations and Assertions. When the Elaboration further elaborates on the Assertion or Evidence, it forms a “Detail Relation”. When the Elaboration precedes other types of units to provide background information or serve structural purposes, it forms an “Background Relation”. Restatement component and its corresponding Major Claim or Claim form a “Restatement Relation”. Additionally, it is necessary to identify the logical relation between adjacent claims, including “Coherence Relation”, “Progression Relation”, “Contrast Relation”, and “Concession Relation”. When there is a clear hierarchical logical relation between the units of the same type of argument component, it is also necessary to indicate it. Note: Multiple consecutive sentences of the same type may form a argument unit, and the relations between argument units may involve multiple types.
The input argumentative essay is divided into sentences and numbered. Only output the sentence numbers corresponding to the argument units and the relation types between the argument unit pairs; do not output any extra details.

Output:
#2, #3 → #1: [“Detail Relation”] [SEP] #4 → #1: [“Positive Argumentation”, “Citation Argumentation”] [SEP]
……

Note: the Input entries are identical in content to those in Table A5.

Table A7. Data example of the CoT method.

Instruction:
You are an experienced high school Chinese language teacher. Please analyze the following argumentative essay, identify its argument components, and determine the types of relations between the argument components.
The argument components consist of 4 coarse-grained and 10 fine-grained types (i.e., Assertion: major claim, claim and restated claim; Evidence: fact, anecdote, quotation, proverb, and axiom; Elaboration; and Others). The relations between argument components can be categorized as follows: based on whether the content directly supports the assertion or indirectly strengthens it by addressing opposing views, the relations can be divided into “Positive Argumentation”, “Negative Argumentation”, and “Comparative Argumentation”. These commonly occur between the Assertion (Major Claim, Claim, and Restated Claim) and Evidence (Fact, Anecdote, Quotation, Proverb, and Axiom), Claim and Major Claim, as well as between Elaborations and Assertions. Based on the type of evidence, the relation can be classified into “Example Argumentation” and “Citation Argumentation”, which appear between Evidences and Assertions. Based on the rhetorical methods used in elaboration components, they can be categorized into “Metaphorical Argumentation” and “Hypothetical Argumentation”, which typically occur between Elaborations and Assertions. When the Elaboration further elaborates on the Assertion or Evidence, it forms a “Detail Relation”. When the Elaboration precedes other types of units to provide background information or serve structural purposes, it forms an “Background Relation”. Restatement component and its corresponding Major Claim or Claim form a “Restatement Relation”. Additionally, it is necessary to identify the logical relation between adjacent claims, including “Coherence Relation”, “Progression Relation”, “Contrast Relation”, and “Concession Relation”. When there is a clear hierarchical logical relation between the units of the same type of argument component, it is also necessary to indicate it. Note: Multiple consecutive sentences of the same type may form a argument unit, and the relations between argument units may involve multiple types.
Please follow the two steps below to complete the argumentative essay analysis: Step 1: Argument component detection and classification. Combine BIO tagging to indicate the boundaries of the corresponding argument units. Please note that only output the sentence number “#ID” and the corresponding B/I-component type. Step 2: Relation detection and classification. Identify and classify the relations between the argument components. The input argumentative essay is divided into sentences and numbered. Only output the sentence numbers corresponding to the argument units and the relation types between the argument unit pairs; do not output any extra details.

Output:
Step 1: Argument component detection and classification Result:
#1: B-Major Claim, #2: B-Elaboration, #3: I-Elaboration, #4: B-Quotation, ……
Step 2: Relation detection and classification Result:
#2, #3 → #1: [“Detail Relation”] [SEP] #4 → #1: [“Positive Argumentation”, “Citation Argumentation”] [SEP] ……

Note: the Input entries are identical in content to those in Table A5.

Table A8. Data example of the MTL method.

Instruction

Output

You are an experienced high school Chinese language teacher. Please analyze the following argumentative essay and determine the argument component type for each sentence.
Task1: Argument Component Prediction. The argument component types include Major Claim, Claim, Restated Claim, Fact, Anecdote, Quotation, Proverb, Axiom, Elaboration, and Others. Multiple consecutive sentences of the same type may form a argument unit.
Combine BIO tagging to indicate the boundaries of the corresponding argument units. Please note that only output the sentence number “#ID” and the corresponding B/I-component type.

Task 1: Argument Component Prediction Results.
#1: B-Major Claim,
#2: B-Elaboration,
#3: I-Elaboration,
#4: B-Quotation,
……

You are an experienced high school Chinese language teacher. Please analyze the following argumentative essay and determine the types of relations between the argument components.
Task2: Argument Strategy Prediction. The argument component types include Major Claim, Claim, Restated Claim, Fact, Anecdote, Quotation, Proverb, Axiom, Elaboration, and Others. Multiple consecutive sentences of the same type may form a argument unit. The relations between argument components can be categorized as follows: based on whether the content directly supports the assertion or indirectly strengthens it by addressing opposing views, the relations can be divided into “Positive Argumentation”, “Negative Argumentation”, and “Comparative Argumentation”. These commonly occur between the Assertion (Major Claim, Claim, and Restated Claim) and Evidence (Fact, Anecdote, Quotation, Proverb, and Axiom), Claim and Major Claim, as well as between Elaborations and Assertions. Based on the type of evidence, the relation can be classified into “Example Argumentation” and “Citation Argumentation”, which appear between Evidences and Assertions. Based on the rhetorical methods used in elaboration components, they can be categorized into “Metaphorical Argumentation” and “Hypothetical Argumentation”, which typically occur between Elaborations and Assertions. When the Elaboration further elaborates on the Assertion or Evidence, it forms a “Detail Relation”. When the Elaboration precedes other types of units to provide background information or serve structural purposes, it forms an “Background Relation”. Restatement component and its corresponding Major Claim or Claim form a “Restatement Relation”. Additionally, it is necessary to identify the logical relation between adjacent claims, including “Coherence Relation”, “Progression Relation”, “Contrast Relation”, and “Concession Relation”. When there is a clear hierarchical logical relation between the units of the same type of argument component, it is also necessary to indicate it. Note: Multiple consecutive sentences of the same type may form a argument unit, and the relations between argument units may involve multiple types.
The input argumentative essay is divided into sentences and numbered. Only output the sentence numbers corresponding to the argument units and the relation types between the argument unit pairs; do not output any extra details.

Task2: Argument Strategy Prediction Results.
#2, #3 → #1: [“Detail Relation”][SEP]
#4 → #1: [“Positive Argumentation”, “Citation Argumentation”] [SEP]
……

Note: the Input entries are identical in content to those in Table A5.

References

Zheng, X.L.; Huang, J.; Xia, X.H.; Hwang, G.J.; Tu, Y.F.; Huang, Y.P.; Wang, F. Effects of online whiteboard-based collaborative argumentation scaffolds on group-level cognitive regulations, written argument skills and regulation patterns. Comput. Educ. 2023, 207, 104920. [Google Scholar] [CrossRef]
Turós, M.; Kenyeres, A.Z.; Balla, G.; Gazdag, E.; Szabó, E.; Szűts, Z. A toulmin model analysis of student argumentation on artificial intelligence. Educ. Sci. 2025, 15, 1226. [Google Scholar] [CrossRef]
Thomas, D.P. Structuring written arguments in primary and secondary school: A systemic functional linguistics perspective. Linguist. Educ. 2022, 72, 101120. [Google Scholar] [CrossRef]
Liu, M.; Zhang, L.J.; Biebricher, C. Investigating students’ cognitive processes in generative AI-assisted digital multimodal composing and traditional writing. Comput. Educ. 2024, 211, 104977. [Google Scholar] [CrossRef]
Wu, T.T.; Silitonga, L.M.; Murti, A.T. Enhancing English writing and higher-order thinking skills through computational thinking. Comput. Educ. 2024, 213, 105012. [Google Scholar] [CrossRef]
Anderson, R.C.; Chaparro, E.A.; Smolkowski, K.; Cameron, R. Visual thinking and argumentative writing: A social-cognitive pairing for student writing development. Assess. Writ. 2023, 55, 100694. [Google Scholar] [CrossRef]
Ulfa, S.M.; Purwati, O. Argumentative Essay Patterns Produced by University Students. J. Engl. Educ. Teach. 2023, 7, 595–612. [Google Scholar] [CrossRef]
Morris, C.; Deehan, J.; MacDonald, A. Written argumentation research in English and science: A scoping review. Cogent Educ. 2024, 11, 2356983. [Google Scholar] [CrossRef]
Lin, Y.R.; Fan, B.; Xie, K. The influence of a web-based learning environment on low achievers’ science argumentation. Comput. Educ. 2020, 151, 103860. [Google Scholar] [CrossRef]
Latifi, S.; Noroozi, O.; Hatami, J.; Biemans, H.J. How does online peer feedback improve argumentative essay writing and learning? Innov. Educ. Teach. Int. 2021, 58, 195–206. [Google Scholar] [CrossRef]
Zhang, R.; Zou, D.; Cheng, G. Chatbot-based training on logical fallacy in EFL argumentative writing. Innov. Lang. Learn. Teach. 2023, 17, 932–945. [Google Scholar] [CrossRef]
Li, X.; Jiang, S.; Hu, Y.; Feng, X.; Chen, W.; Ouyang, F. Investigating the impact of structured knowledge feedback on collaborative academic writing. Educ. Inf. Technol. 2024, 29, 19005–19033. [Google Scholar] [CrossRef]
Schaller, N.J.; Horbach, A.; Höft, L.I.; Ding, Y.; Bahr, J.L.; Meyer, J.; Jansen, T. DARIUS: A Comprehensive Learner Corpus for Argument Mining in German-Language Essays. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 4356–4367. [Google Scholar]
Xiao, C.; Ma, W.; Song, Q.; Xu, S.X.; Zhang, K.; Wang, Y.; Fu, Q. Human-ai collaborative essay scoring: A dual-process framework with llms. In Proceedings of the 15th International Learning Analytics and Knowledge Conference, Dublin, Ireland, 3–7 March 2025; pp. 293–305. [Google Scholar]
Song, W.; Song, Z.; Liu, L.; Fu, R. Hierarchical multi-task learning for organization evaluation of argumentative student essays. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Yokohama, Japan, 7–15 January 2021; pp. 3875–3881. [Google Scholar]
Ren, Y.; Zhou, X.; Zhang, N.; Zhao, S.; Lan, M.; Bai, X. Towards Comprehensive Argument Analysis in Education: Dataset, Tasks, and Method. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Che, W., Nabende, J., Shutova, E., Pilehvar, M.T., Eds.; Association for Computational Linguistics: Vienna, Austria, 2025; pp. 14215–14231. [Google Scholar] [CrossRef]
Özalp, D. Preservice Teachers Learn to Engage in Argument from Evidence through the Science Writing Heuristic. Int. J. Sci. Math. Educ. 2025, 23, 949–986. [Google Scholar] [CrossRef]
Majidi, A.E.; Graaff, R.D.; Janssen, D. Debate pedagogy as a conducive environment for L2 argumentative essay writing. Lang. Teach. Res. 2023, 13621688231156998. [Google Scholar] [CrossRef]
Iqbal, S.; Rakovic, M.; Chen, G.; Li, T.; Bajaj, J.; Mello, R.F.; Fan, Y.; Aljohani, N.R.; Gasevic, D. Towards Improving Rhetorical Categories Classification and Unveiling Sequential Patterns in Students’ Writing. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 656–666. [Google Scholar]
Yang, G.; Zheng, X.Q.; Li, Q.; Han, M.; Tu, Y.F. An empirical study on how cognitive diagnostic feedback affects primary school pupils’ learning of Chinese writing. Interact. Learn. Environ. 2024, 32, 2758–2775. [Google Scholar]
Guo, J.; Cheng, L.; Zhang, W.; Kok, S.; Li, X.; Bing, L. AQE: Argument Quadruplet Extraction via a Quad-Tagging Augmented Generative Approach. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 932–946. [Google Scholar] [CrossRef]
Chuang, P.L.; Yan, X. An investigation of the relationship between argument structure and essay quality in assessed writing. J. Second. Lang. Writ. 2022, 56, 100892. [Google Scholar] [CrossRef]
Auliya, P.K.; Amrullah, Q.L. Analyzing the flow of ideas in university students’ cause and effect essay. Innov. Res. J. 2024, 5, 1–9. [Google Scholar] [CrossRef]
Ren, Y.; Wu, H.; Long, Z.; Zhao, S.; Zhou, X.; Yin, Z.; Zhuang, X.; Bai, X.; Lan, M. CEAMC: Corpus and Empirical Study of Argument Analysis in Education via LLMs. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 6949–6966. [Google Scholar]
Lawrence, J.; Reed, C. Argument mining: A survey. Comput. Linguist. 2019, 45, 765–818. [Google Scholar]
Farsani, M.A.; Stapleton, P.; Jamali, H.R. Charting L2 argumentative writing: A systematic review. J. Second Lang. Writ. 2025, 68, 101208. [Google Scholar] [CrossRef]
Ervas, F.; Mosca, O. An experimental study on the evaluation of metaphorical ad hominem arguments. Informal Log. 2024, 44, 249–277. [Google Scholar] [CrossRef]
Stahl, M.; Michel, N.; Kilsbach, S.; Schmidtke, J.; Rezat, S.; Wachsmuth, H. A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 2661–2674. [Google Scholar]
Iqbal, S.; Rakovic, M.; Chen, G.; Li, T.; Ferreira Mello, R.; Fan, Y.; Fiorentino, G.; Radi Aljohani, N.; Gasevic, D. Towards automated analysis of rhetorical categories in students essay writings using Bloom’s taxonomy. In Proceedings of the LAK23: 13th International Learning Analytics and Knowledge Conference, Arlington, TX, USA, 13–17 March 2023; pp. 418–429. [Google Scholar]
Ferreira Mello, R.; Fiorentino, G.; Miranda, P.; Oliveira, H.; Raković, M.; Gašević, D. Towards automatic content analysis of rhetorical structure in brazilian college entrance essays. In Proceedings of the International Conference on Artificial Intelligence in Education, Online, 6–10 June 2021; Springer: Cham, Switzerland, 2021; pp. 162–167. [Google Scholar]
Ferreira Mello, R.; Fiorentino, G.; Oliveira, H.; Miranda, P.; Rakovic, M.; Gasevic, D. Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese. In Proceedings of the LAK22: 12th International Learning Analytics and Knowledge Conference, Online, 21–25 March 2022; pp. 404–414. [Google Scholar]
Oliveira, H.; Ferreira Mello, R.; Barreiros Rosa, B.A.; Rakovic, M.; Miranda, P.; Cordeiro, T.; Isotani, S.; Bittencourt, I.; Gasevic, D. Towards explainable prediction of essay cohesion in portuguese and english. In Proceedings of the LAK23: 13th International Learning Analytics and Knowledge Conference, Arlington, TX, USA, 13–17 March 2023; pp. 509–519. [Google Scholar]
Chen, G.; Cheng, L.; Tuan, L.A.; Bing, L. Exploring the Potential of Large Language Models in Computational Argumentation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 2309–2330. [Google Scholar]
Shi, L.; Giunchiglia, F.; Luo, R.; Shi, D.; Song, R.; Diao, X.; Xu, H. An empirical study of LLMs via in-context learning for stance classification. Inf. Process. Manag. 2026, 63, 104322. [Google Scholar]
Gorur, D.; Rago, A.; Toni, F. Can Large Language Models perform Relation-based Argument Mining? In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 8518–8534. [Google Scholar]
Cabessa, J.; Hernault, H.; Mushtaq, U. Argument mining with fine-tuned large language models. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 6624–6635. [Google Scholar]
Meyer, J.; Jansen, T.; Schiller, R.; Liebenow, L.W.; Steinbach, M.; Horbach, A.; Fleckenstein, J. Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions. Comput. Educ. Artif. Intell. 2024, 6, 100199. [Google Scholar] [CrossRef]
Mansour, W.A.; Albatarni, S.; Eltanbouly, S.; Elsayed, T. Can large language models automatically score proficiency of written essays? In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 2777–2786. [Google Scholar]
Lin, C.C.; Huang, A.Y.; Lu, O.H. Artificial intelligence in intelligent tutoring systems toward sustainable education: A systematic review. Smart Learn. Environ. 2023, 10, 41. [Google Scholar] [CrossRef]
Liu, X. The Difference Between Chinese and American Secondary Education Curriculum System. In Proceedings of the 2025 International Conference on Mental Growth and Human Resilience (MGHR 2025); Atlantis Press: Fukui, Japan, 2025; pp. 871–878. [Google Scholar]
Lu, Y. Comparative analysis of teaching methods: A cross-cultural study of Chinese and American educational systems. Trans. Soc. Sci. Educ. Humanit. Res. 2024, 4, 59–64. [Google Scholar] [CrossRef]
Jurišević, N.; Nikolić, N.; Nemś, A.; Gordić, D.; Rakić, N.; Končalović, D.; Kocsis, D. Bridging LLMs, Education, and Sustainability: Guiding Students in Local Community Initiatives. Sustainability 2025, 17, 10148. [Google Scholar] [CrossRef]
Park, B.; Seo, K. Assessing critical thinking through a multi-agent llm-based debate chatbot. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Yokohama, Japan, 26 April–1 May 2025; pp. 1–13. [Google Scholar]
Reid, J.W.; Parrish, J.; Syed, S.B.; Couch, B. Finding the connections: A scoping review of epistemic network analysis in science education. J. Sci. Educ. Technol. 2025, 34, 937–955. [Google Scholar] [CrossRef]
Singh, S.S.; Muhuri, S.; Mishra, S.; Srivastava, D.; Shakya, H.K.; Kumar, N. Social Network Analysis: A Survey on Process, Tools, and Application. Acm Comput. Surv. 2024, 56, 1–39. [Google Scholar]
Mann, W.C.; Thompson, S.A. Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdiscip. J. Study Discourse 1988, 8, 243–281. [Google Scholar] [CrossRef]
Walton, D. Using argumentation schemes to find motives and intentions of a rational agent. Argum. Comput. 2020, 10, 233–275. [Google Scholar] [CrossRef]
Kennard, N.N.; O’Gorman, T.; Das, R.; Sharma, A.; Bagchi, C.; Clinton, M.; Yelugam, P.K.; Zamani, H.; McCallum, A. DISAPERE: A dataset for discourse structure in peer review discussions. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Seattle, WA, USA, 2022; pp. 1234–1249. [Google Scholar]
Cheng, L.; Bing, L.; He, R.; Yu, Q.; Zhang, Y.; Si, L. IAM: A comprehensive and large-scale dataset for integrated argument mining tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 2277–2287. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations, Virtual Conference, 25–29 April 2022. [Google Scholar]
Bahri, Y.; Dyer, E.; Kaplan, J.; Lee, J.; Sharma, U. Explaining neural scaling laws. Proc. Natl. Acad. Sci. USA 2024, 121, e2311878121. [Google Scholar] [CrossRef]
Ding, Y.; Kashefi, O.; Somasundaran, S.; Horbach, A. When argumentation meets cohesion: Enhancing automatic feedback in student writing. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italy, 20–25 May 2024; pp. 17513–17524. [Google Scholar]
Chen, S.; Zhang, Y.; Yang, Q. Multi-task learning in natural language processing: An overview. Acm Comput. Surv. 2024, 56, 1–32. [Google Scholar] [CrossRef]
Elmoazen, R.; Saqr, M.; Tedre, M.; Hirsto, L. A systematic literature review of empirical research on epistemic network analysis in education. IEEE Access 2022, 10, 17330–17348. [Google Scholar] [CrossRef]
Wu, Y.; Lan, W.; Fan, X.; Fang, K. Bipartite network influence analysis of a two-mode network. J. Econom. 2024, 239, 105562. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, Y.; Yu, R.; Zuo, J.; Dong, N. Managing the high capital cost of prefabricated construction through stakeholder collaboration: A two-mode network analysis. Eng. Constr. Archit. Manag. 2025, 32, 556–577. [Google Scholar] [CrossRef]

Figure 1. Research design. STL represents single-task learning, CoT represents chain-of-thought prompting technique, and MTL represents multi-task learning.

Figure 2. Distribution of essay scores in the research data [16].

Figure 3. Annotation example (excerpt). Argument component types are marked in red, the blue arrows on the right denote vertical argument relations, and the green arrow on the left represents horizontal logical relations. The text above each arrow corresponds to its specific relation type. Paragraphs are separated by boxes, and line breaks indicate distinct argument units.

Figure 4. Distribution of argument components [24]. Dashed lines separate different coarse-grained argument component types.

Figure 5. Distribution of argument strategies [16]. The top shows horizontal discourse relations; the bottom three groups show vertical argument relations from different aspects.

Figure 6. Statistics of experimental data partitioning [16]. AC denotes argument component. A single argument component may span multiple consecutive sentences, and a pair of argument components can exhibit multiple relation types.

Figure 7. The performance of the Qwen3 series models (including 0.6 B, 1.7 B, 4 B, and 8 B) on argument component prediction and argument strategy prediction tasks.

Figure 8. ENA results of (a) argument components and (b) argument strategies in high-quality essays (red) and low-quality essays (blue).

Figure 9. Two-mode network of high-quality (red) and low-quality (blue) essays in the connection of argument components (yellow) and strategies (green).

Table 1. Results of argument component prediction task.

Model	Method	Micro-Precision	Micro-Recall	Micro- $F_{1}$	Macro- $F_{1}$
DeepSeek	Zero-shot	$0.0451 \pm 0.0132$	$0.0580 \pm 0.0166$	$0.0507 \pm 0.0147$	$0.0378 \pm 0.0151$
	STL	$0.5349 \pm 0.0084$	$0.5130 \pm 0.0095$	$0.5237 \pm 0.0053$	$0.4217 \pm 0.0179$
	CoT	$0.5303 \pm 0.0030$	$0.5246 \pm 0.0071$	$0.5274 \pm 0.0045$	$0.4565 \pm 0.0317$
	MTL	$0.5665 \pm 0.0048$	$0.5594 \pm 0.0095$	$0.5629 \pm 0.0072$	$0.4202 \pm 0.0223$
Qwen	Zero-shot	$0.1890 \pm 0.0052$	$0.2039 \pm 0.0072$	$0.1962 \pm 0.0061$	$0.1275 \pm 0.0060$
	STL	$0.5646 \pm 0.0071$	$0.5739 \pm 0.0085$	$0.5692 \pm 0.0076$	$0.4567 \pm 0.0353$
	CoT	$0.5715 \pm 0.0107$	$0.5749 \pm 0.0158$	$0.5732 \pm 0.0127$	$0.4373 \pm 0.0095$
	MTL	$0.5857 \pm 0.0100$	$0.5991 \pm 0.0178$	$0.5923 \pm 0.0137$	$0.4898 \pm 0.0050$
ChatGLM	Zero-shot	$0.2016 \pm 0.0104$	$0.2367 \pm 0.0145$	$0.2177 \pm 0.0121$	$0.1294 \pm 0.0194$
	STL	$0.5843 \pm 0.0137$	$0.5961 \pm 0.0130$	$0.5902 \pm 0.0132$	$0.5078 \pm 0.0286$
	CoT	$0.5872 \pm 0.0107$	$0.5923 \pm 0.0117$	$0.5897 \pm 0.0106$	$\underset{̲}{0.5227 \pm 0.0392}$
	MTL	$\underset{̲}{0.5935 \pm 0.0161}$	$\underset{̲}{0.6029 \pm 0.0247}$	$\underset{̲}{0.5981 \pm 0.0203}$	$0.5141 \pm 0.0328$

Note: Bold values indicate the best performance for each model across different methods. Underlined values indicate the overall best results across all models and methods.

Table 2. Results of argument strategy prediction task.

Model	Method	Micro-Precision	Micro-Recall	Chunk- $F_{1}$	Sentence- $F_{1}$
DeepSeek	Zero-shot	$0.0195 \pm 0.0014$	$0.0164 \pm 0.0000$	$0.0178 \pm 0.0006$	$0.0248 \pm 0.0031$
	STL	$0.1923 \pm 0.0053$	$0.1951 \pm 0.0017$	$0.1936 \pm 0.0028$	$0.2373 \pm 0.0136$
	CoT	$0.2384 \pm 0.0266$	$0.2293 \pm 0.0257$	$0.2338 \pm 0.0261$	$0.2855 \pm 0.0166$
	MTL	$0.2113 \pm 0.0112$	$0.2012 \pm 0.0093$	$0.2061 \pm 0.0102$	$0.2559 \pm 0.0061$
Qwen	Zero-shot	$0.0390 \pm 0.0096$	$0.0349 \pm 0.0077$	$0.0368 \pm 0.0085$	$0.0496 \pm 0.0061$
	STL	$0.2417 \pm 0.0097$	$0.2389 \pm 0.0068$	$0.2402 \pm 0.0079$	$0.2848 \pm 0.0157$
	CoT	$0.3010 \pm 0.0190$	$0.2957 \pm 0.0226$	$0.2983 \pm 0.0208$	$0.3309 \pm 0.0189$
	MTL	$0.2926 \pm 0.0075$	$0.2943 \pm 0.0130$	$0.2934 \pm 0.0103$	$0.3281 \pm 0.0070$
ChatGLM	Zero-shot	$0.0389 \pm 0.0016$	$0.0472 \pm 0.0060$	$0.0426 \pm 0.0034$	$0.0507 \pm 0.0053$
	STL	$0.2721 \pm 0.0088$	$0.2621 \pm 0.0048$	$0.2670 \pm 0.0067$	$0.2949 \pm 0.0106$
	CoT	$\underset{̲}{0.3166 \pm 0.0140}$	$\underset{̲}{0.3101 \pm 0.0133}$	$\underset{̲}{0.3133 \pm 0.0136}$	$\underset{̲}{0.3445 \pm 0.0288}$
	MTL	$0.2683 \pm 0.0109$	$0.2793 \pm 0.0143$	$0.2734 \pm 0.0089$	$0.3128 \pm 0.0129$

Note: Bold values indicate the best performance for each model across different methods. Underlined values indicate the overall best results across all models and methods.

Table 3. Prediction results of the Qwen3-4B model under different methods using the composition data corresponding to Figure 3 as an example.

Gold	STL	CoT	MTL
Argument Component Prediction
#1 B-Elaboration	#1 B-Quotationn	✓	✓
#2 I-Elaboration	#2 I-Quotationn	✓	✓
#3 B-Claim	✓	✓	✓
#10 B-Claim	#10 B-Major Claim	✓	✓
#13 B-Fact	✓	✓	✓
#14 B-Elaboration	✓	✓	✓
#15 I-Elaboration	✓	✓	✓
#16 B-Claim	✓	✓	✓
Argument Strategy Prediction
#1, #2 -> #3: [’Background’]	✓	✓	✓
#14, #15 -> #13: [’Detail’]	#15 -> #14: [’Detail’]	#14, #15 -> #16: [’Background’]	✓
#3 -> #10: [’Coherence’]	None	✓	#3 -> #10: [’Progression’]
#10 -> #16: [’Progression’]	None	#10 -> #16: [’Coherence’]	✓

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ren, Y.; Zhang, N.; Li, X.; Zhang, Y.; Chen, Y.; Lan, M. Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing. Sustainability 2026, 18, 3338. https://doi.org/10.3390/su18073338

AMA Style

Ren Y, Zhang N, Li X, Zhang Y, Chen Y, Lan M. Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing. Sustainability. 2026; 18(7):3338. https://doi.org/10.3390/su18073338

Chicago/Turabian Style

Ren, Yupei, Ning Zhang, Xiaoyu Li, Yadong Zhang, Yuqing Chen, and Man Lan. 2026. "Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing" Sustainability 18, no. 7: 3338. https://doi.org/10.3390/su18073338

APA Style

Ren, Y., Zhang, N., Li, X., Zhang, Y., Chen, Y., & Lan, M. (2026). Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing. Sustainability, 18(7), 3338. https://doi.org/10.3390/su18073338

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Sustainable Education: Generative AI-Powered Argument Mining in Student Writing

Abstract

1. Introduction

2. Related Work

2.1. Argument Components and Strategies in Student Essays

2.2. Automated Argumentation Analysis Techniques

2.3. Large Language Models for Sustainable Education

3. Methods

3.1. Research Design

3.2. Research Data

3.3. Coding Scheme

3.3.1. Argument Component

3.3.2. Argument Strategy

3.3.3. Coding Process and Result

3.4. Automated Classification of Argument Component and Strategy Using LLMs

3.5. Data Analysis

4. Results

4.1. Empirical Comparison of Leading LLMs in Identifying Argument Components and Strategies

4.2. The Relationships Among Argument Components, Strategies, and Writing Quality

4.3. Case Study of LLM Prediction Results

5. Discussion

5.1. Automated Classification of Argument Components and Strategies Using LLMs

5.2. Variations in Argumentation Between Different Writing Qualities

5.3. Theoretical and Practical Implications

5.4. Limitations and Future Directions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. More Details of Research Data

Appendix B. More Details of Coding Scheme

Appendix B.1. Argument Component

Appendix B.2. Argument Strategy

Appendix C. More Details of Methods

Appendix C.1. Concept of BIO

Appendix C.2. STL Prompt Data

Appendix C.3. CoT Prompt Data

Appendix C.4. MTL Prompt Data

Appendix C.5. Concept of ENA and Two-Mode Network

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI