1. Introduction
Recent advances in computing platforms and artificial intelligence have enabled the deployment of intelligent systems that emphasize real-time interaction, adaptability, and user-centered design. Human–computer interaction (HCI) has emerged as a crucial element in developing user-centric writing assessment systems [
1]. With the increasing integration of intelligent systems into education, HCI-enabled smart learning environments are now expected not only to reduce cognitive load but also to foster metacognitive awareness and support formative assessment practices that enhance learning effectiveness [
2]. Unlike traditional static evaluations, HCI-enabled systems provide dynamic, personalized feedback, enhancing the learning process and student engagement [
3]. In the domain of academic writing, especially argumentative essays, such systems help students improve their arguments through interactive feedback [
4]. This iterative refinement is particularly valuable in educational settings where developing logical reasoning and persuasive communication is a core learning objective.
Toulmin’s model [
5] provides a structured and interpretable framework that is well suited for computational modeling in HCI-enabled writing assessment. By breaking down arguments into six distinct components, namely—claim, data, counter claim, counter data, rebuttal, and rebuttal data—the model enables intelligent systems to assess not only the presence of argumentation but also its quality, structure, and depth. This granularity aligns well with the requirements of interactive writing platforms, which aim to provide detailed feedback and personalized guidance to users in real time. Furthermore, the universality of Toulmin’s framework enables its adaptation to diverse linguistic contexts and educational objectives, making it a practical foundation for intelligent feedback mechanisms in HCI-based writing environments. Its alignment with established pedagogical standards, such as those in critical thinking curricula and EFL writing instruction, further strengthens its relevance for smart learning applications. The system design adheres to Sweller’s Cognitive Load Theory [
6] and Anderson’s Adaptive Learning Framework [
7]. HCI-enabled systems leverage interactive interfaces and adaptive algorithms to support dynamic user engagement and iterative refinement, thereby improving both efficiency and usability.
The six elements of the Toulmin model represent the most comprehensive classification of argumentative elements, providing clear standards for evaluating the effectiveness, depth, and rationality of arguments, and serving as a key tool for measuring the logical rigor and persuasiveness of argumentative essays. The Toulmin model provides a well-defined analytical framework that is highly suitable for computational implementation. Its explicit structure supports explainability and aligns with the requirements of interactive intelligent systems, making it a promising foundation for AI-driven argument analysis.
However, the integration of HCI-based intelligent systems with Toulmin’s model remains underexplored, particularly when it comes to automating the identification of argument elements using advanced machine learning techniques. This study bridges that gap by developing an automated system that employs HCI principles alongside advanced text analysis models, dynamically adapts to the argument structure, and enhances the identification of Toulmin’s elements, supporting technical support for intelligent educational systems in more efficient essay evaluation [
8]. Such integration has the potential to support smart learning, where AI can serve not as an evaluator but as a collaborative partner in the writing development process.
However, it is not easy to identify the six elements of the Toulmin model. In traditional analysis methods, researchers usually need to analyze the article sentence by sentence and manually assess its argumentation elements, which takes much time and energy and is subject to the bias and limitations of subjective judgment. Automatic identification of the six elements of the Toulmin model has become an important research direction in the field of natural language processing. This approach not only enhances the efficiency of understanding and evaluating argumentative essays but also provides technical support for intelligent education systems, aiding teachers in automatic scoring and feedback. In classroom contexts, this automation can free instructors from repetitive annotation tasks, allowing them to focus on higher-order instructional interventions.
Traditional rule-based or shallow machine learning methods cannot fully capture the complex semantic relationships and contextual information in sentences, resulting in inaccurate identification of elements. At present, deep learning technology has been used to automatically mine and expand the key markers of the six elements of the Toulmin model. However, existing deep learning approaches for Toulmin element identification face two key limitations. First, they often rely on sentence-level analysis, which struggles with complex or implicit argument structures. Second, they underutilize lexical markers that are crucial for fine-grained component recognition. These shortcomings limit both accuracy and efficiency, particularly in real-time educational settings.
Recent studies have explored the integration of NLP and HCI techniques in writing assessment systems. For example, the Write & Improve platform developed by Cambridge University leverages AI to provide automated grammar and coherence suggestions, helping learners iteratively revise their writing (Bannò et al. [
9]). Similarly, Bai et al. [
10] highlight the importance of using a human–AI collaborative feedback system to support EFL writers, suggesting that this hybrid approach is a potent tool for improving writing performance by addressing both cognitive and affective learning domains. While these systems have improved user engagement and surface-level accuracy, they largely focus on low-level language features such as syntax, word choice, or fluency. They fail to address the deeper rhetorical and logical dimensions of writing, particularly in argumentation quality. Moreover, most HCI writing tools do not adopt a structured argumentation framework like Toulmin’s, resulting in feedback that lacks interpretability. Therefore, there is a pressing need for HCI-based writing systems that incorporate a fine-grained argumentative framework, allowing the system to identify argument components. This direction is especially critical in smart learning ecosystems, where the goal is not just error correction but the cultivation of higher-order thinking skills through scaffolded, theory-informed AI assistance.
Current HCI-enabled writing tools (e.g., Write & Improve [
9]) focus on surface-level feedback (grammar, fluency) but lack structured argument analysis capabilities, which limits their pedagogical value for developing critical thinking.
To address these challenges, this study proposes KFF-Transformer, an intelligent argument analysis framework that integrates deep learning-based key marker mining with a human-in-the-loop HCI system. By combining word-level argument identification with interactive user feedback, the proposed system bridges the gap between advanced argument mining techniques and practical intelligent applications. The main contributions of this work are threefold:
- (1)
A key marker–driven deep learning architecture for fine-grained identification of Toulmin argument elements;
- (2)
An interactive human–AI collaborative workflow that supports real-time correction and adaptive refinement;
- (3)
An efficient system design validated on both CPU and GPU platforms, demonstrating feasibility for real-world deployment.
These features collectively position KFF-Transformer as a smart learning tool that supports both learners’ argumentative competence and educators’ instructional capacity. Such design aligns with Grudin’s principles of shared control in collaborative systems [
11] and reduces cognitive load [
2].
It should be noted that our framework utilizes BERT [
12]—a pre-trained Transformer-based encoder—as the underlying contextual representation backbone. However, the core sequence labeling module responsible for Toulmin element identification is built upon a BiLSTM network enhanced with an attention mechanism and does not employ a standard Transformer block (e.g., multi-head self-attention followed by position-wise feed-forward layers). The name “KFF-Transformer” thus reflects both the foundational role of BERT and the framework’s functional purpose as a transformer from unstructured discourse to structured argument annotations.
Our work is grounded in Toulmin’s model of argumentation [
5], a foundational framework in educational and computational discourse analysis. Originally proposed by Stephen Toulmin in
The Uses of Argument (1958), it decomposes arguments into six core components: claim, data, counter claim, counter data, rebuttal, and rebuttal data. This model has been widely adopted in writing instruction, critical thinking pedagogy, and automated argument mining due to its interpretability and alignment with human reasoning processes [
13,
14]. The structured output of our KFF-Transformer aligns with principles of pedagogically usable analytics [
15], supporting the potential for formative feedback in argumentation instruction.
2. Related Work
The related work on the identification of argument elements in English argumentative essays is introduced from three aspects: the Toulmin model, the framework of argumentation identification technology, and the identification of key markers. First, the Toulmin model clearly defines the claim, data, and their logical relations in an argumentative essay. Second, the framework of argumentation identification technology can be used to automatically identify and analyze argument elements. Finally, the identification of key markers can effectively improve the efficiency and accuracy of identifying the six Toulmin elements. These three dimensions collectively underpin the development of intelligent writing support systems that align with the goals of smart learning—namely, real-time, interpretable, and pedagogically meaningful feedback.
2.1. Toulmin Model
Toulmin’s argument model, proposed by British philosopher Stephen Toulmin [
5], is widely used in argument analysis and argumentative writing. This model consists of six core elements: claim, data, counter claim, counter data, rebuttal, and rebuttal data. It provides writers with a systematic model to logically develop their arguments and allows readers to better understand the logical structure of the argumentation.
Liu [
16] pointed out that the Toulmin model demonstrates unique advantages in essay writing. It not only helps authors clearly organize and elaborate on complex arguments, but also shows students an effective method of analyzing and evaluating arguments, thus enhancing their critical thinking ability. With clear definitions of claims and data, the model guides authors on how to effectively use evidence to support their positions. At the same time, rebuttals and modal qualifiers encourage authors to consider potential opposing viewpoints and enhance the persuasiveness and rigor of their arguments by setting reasonable boundaries. This structured approach makes the Toulmin model particularly suitable for integration into educational AI tools, where transparency and scaffolding are essential for formative learning.
Although the Toulmin model is useful in practice, researchers may encounter difficulties in dealing with complex and multi-layered argument structures, especially in the case of multi-party debates, where a simple six-element framework may not be able to cover all the details of the argument. Additionally, although the Toulmin model has been widely used in English writing, its applicability in non-English contexts has not been fully tested, especially for texts with different language structures and cultural backgrounds, the effect of the model may not be ideal. Another common problem is that the Toulmin model relies on the author’s judgment of each element during its application. Its effectiveness largely depends on the author’s writing experience and logical thinking ability. Therefore, it may be challenging for beginners to identify and distinguish these elements. This highlights the need for AI-augmented systems that can guide novice learners through the Toulmin framework via interactive, adaptive feedback—aligning with the principles of smart learning environments.
2.2. Framework of Argument Identification Technology
The flow chart of argument identification is shown in
Figure 1. This covers four subfields: identification of argumentative sentences (which have a clear position and include argument elements), classification of argument elements, analysis of argument structure, and identification of argument logic. These sub-tasks together form our comprehensive understanding and analysis of argumentative essays [
17].
- (1)
Argumentative sentence identification technology: The core of this technology lies in distinguishing argumentative sentences in the text, which is essentially a binary classification problem. Laha et al. [
18] were the first to introduce deep learning into the field of argument mining to achieve argument boundary detection and argumentative component identification.
- (2)
Argument element classification technology: This technology focuses on dividing the sentences into different categories, such as claims, data, etc. Argument element classification has evolved from traditional supervised learning to using deep learning models like RNN and BiLSTM. For example, Kusmantini et al. [
19] used support vector machines to classify argumentative texts. Li et al. [
20] proposed a joint learning RNN model based on the attention mechanism [
21] to address the argument boundary detection issue.
- (3)
Argument structure analysis technology: It aims to identify the relationship between argument elements in the text, such as support and opposition. Although this task relies on data in a specific field in many cases, the generalization ability and accuracy of the model are constantly improving. For example, Stab et al. [
22] used binary classification SVM to classify argument identification relations into support or opposition in their research on argument structure identification.
- (4)
Argument logic identification technology: This primarily focuses on parsing the relationships between annotated argument elements in the text, which is a complex task requiring the model to not only understand the content but also have a deep insight into the logical relationship behind it. Orith Toledo-Ronen et al. [
23] explored the feasibility of conducting argumentative analysis tasks in non-English environments by using the multilingual BERT model combined with transfer learning strategies. The results show that such methods are well suited for classifying the stance of arguments, but less so for assessing the quality of arguments.
These technologies solve different tasks in argument mining but have certain limitations: Argumentative sentence identification technology is overly reliant on manually designed features, which may struggle with processing deep textual features. Argument element classification technology such as RNN models have high requirements for data quality and computing resources. In terms of argument structure analysis technology, the SVM method of Stab et al. [
22] performs well but lacks generalization ability, especially on cross-domain data. The BERT model can parse complex argument logic, but its adaptability to non-English data and complex argument structures remains a challenge. Moreover, most existing systems operate as closed-loop pipelines with limited capacity for user interaction—limiting their utility in educational settings where learner agency and teacher oversight are essential.
Current research often focuses on single subtasks, such as claim and data identification and argument structure analysis, and often ignores the integrity of argument recognition and the interdependence between subtasks. In summary, existing technologies have their own strengths in dealing with the identification of argument elements in English argumentative essays. Understanding the characteristics of these methods is crucial for selecting the appropriate argument identification technologies.
With the development of natural language processing technology, the Toulmin model has begun to be incorporated into automated analysis tools. For example, Alkhawaldeh et al. [
24] proposed various deep learning models such as Lexical Chain with Multi-Head Attention and a multi-column convolutional neural network to automatically generate Toulmin arguments, demonstrating that combining them with reinforcement learning agents can improve the accuracy of generating and reasoning Toulmin arguments. Mirzababaei et al. [
25] developed a dialogue agent system based on the Toulmin model in the context of educational technology that can identify structural errors in arguments. The classifier developed in the study can detect argument elements such as claims, data, etc., and provide feedback in dialogue to help users improve argument quality. This human-in-the-loop design exemplifies how AI can function as a collaborative tutor rather than a static evaluator—a key principle in smart learning. Yang et al. [
26] proposed a method that combines weighted features and the BiLSTM-Attention model for argument mining in EFL writing. This method generates dynamic word vectors and obtains sentence-level and article-level features through the BiLSTM and Attention mechanisms, thus annotating the article content according to the Toulmin model. Fromm et al. [
27] used the Toulmin model to analyze argument structure in reviews, employing a BERT-based argument mining technology model to automatically detect claims, data, etc., in review texts, thus improving the efficiency of the review process.
In summary, the application of the Toulmin model in argumentative writing shows its wide value. However, the limitations of this model in dealing with complex arguments and its adaptability in cross-language environments still need to be solved through further research and exploration. Crucially, future work must prioritize not just technical accuracy but also pedagogical usability—ensuring that automated systems support, rather than replace, human judgment in learning contexts. The combination of the Toulmin model and deep learning will bring more possibilities for automated argument analysis, which will not only drive deeper academic research, but also bring new development opportunities for educational practice.
Despite these advances, a critical limitation persists in many Toulmin-based mining systems: their reliance on sentence-level classification. Models like the FWFBA framework proposed by Yang et al. [
26] assign a single Toulmin label per sentence, which inherently fails to capture intra-sentence argument spans or cross-sentence dependencies—common patterns in student writing where a claim may span two clauses or data appears across multiple sentences. This coarse granularity leads to structural oversimplification, resulting in an element identification accuracy of only 69.6%, as reported by the authors. More importantly, such approaches require scanning entire essays to infer logical connections, increasing computational overhead and error propagation in complex or implicit arguments. For formative feedback in educational settings, where precision at the clause or phrase level is essential, this limitation significantly undermines pedagogical utility.
Furthermore, while lexical markers (e.g., “because” for Data, “however” for Rebuttal) are well-established rhetorical cues in argumentation theory [
5,
13], most deep learning models—including Yang et al.’s [
26] BiLSTM-Attention pipeline—treat them implicitly through contextual embeddings without explicit exploitation. Their method does not incorporate a dedicated module to detect, validate, or weight these markers based on discourse function, missing opportunities for interpretability and error correction. Compounding this issue, the FWFBA model requires approximately 23.3 s per batch for inference, a latency that is impractical for real-time classroom feedback. In smart learning environments, where immediacy and interactivity are core design principles [
2,
11], such inefficiency limits teacher adoption and learner engagement. These shortcomings highlight the need for architectures that explicitly integrate linguistic knowledge with efficient sequence labeling, enabling both high accuracy and low-latency deployment.
2.3. Deep Learning Approaches for Sequence Labeling
With the development of deep learning, various neural network architectures have been applied to sequence labeling tasks in argument mining. Convolutional neural networks (CNNs) are effective at capturing local lexical patterns, while recurrent neural networks (RNNs) and their bidirectional variants (BiLSTMs) model sequential dependencies by processing text in forward and backward directions, making them well-suited for identifying argument elements such as claims and premises. BiLSTM-based models, in particular, have demonstrated strong performance in educational argument analysis due to their ability to encode contextual information from both past and future tokens.
More recently, Transformer-based architectures have emerged as a powerful alternative, leveraging self-attention mechanisms to model long-range dependencies between tokens regardless of their positional distance. Unlike recurrent models, Transformers enable parallel computation over input sequences and excel at capturing global contextual relationships—properties that are highly beneficial for analyzing complex, multi-component argumentative structures. The success of attention mechanisms has also extended beyond natural language processing; for example, transformer models have been adapted for the automated acquisition of static street view images to analyze building characteristics [
28], and attention-based approaches have shown strong cross-building transferability in fault detection and diagnosis for air handling units in auditoriums and hospitals [
29].
Despite these advances, existing studies in educational argument mining rarely integrate explicit linguistic cues—such as domain-specific key argumentative markers—with structured knowledge about argumentation schemes. As a result, models often fail to leverage the semantic signals that human annotators rely on when identifying Toulmin elements. Therefore, integrating domain knowledge with key argumentative markers is essential to improve the performance and interpretability of argument element recognition in educational contexts.
2.4. Identification of Key Markers
Identifying argument elements in English essays relies heavily on detecting rhetorically significant phrases known as key markers—not merely high-frequency keywords. Unlike general keyword extraction (e.g., TF-IDF or TextRank), argumentative key markers signal specific Toulmin functions: “since” often introduces Data, “nevertheless” signals Rebuttal, and “my view is that” may frame a claim. Effective identification of these markers is crucial for machines to comprehend argument structure and provide interpretable, pedagogically meaningful feedback.
Identifying argument elements in English essays is a critical step in analysis. This process usually relies on the identification of specific words in the text, known as “key markers”. For example, the phrase “for example” is often used to mark the beginning of an illustration, while “I think” may indicate the author’s personal opinion. Effective identification of these key markers is crucial for machines to comprehend the structure and argumentative style of a text. In educational applications, such markers serve as teachable cues that can be highlighted to learners, helping them internalize rhetorical conventions through AI-mediated scaffolding.
The identification of key markers is different from traditional keyword extraction. Keywords usually refer to high-frequency important words in the text. The identification of key markers focuses on mining the important elements of the argument structure in the text, which is of great significance for improving the accuracy and depth of argument analysis.
Early keyword extraction technologies originated from the word-frequency-based method proposed by Luhn [
30] in 1957. These methods have evolved into two methods: unsupervised and supervised. Unsupervised methods such as TextRank and the one from Mihalcea et al. [
31] rely on statistical features and graph models to determine the importance of words through word co-occurrence and the PageRank algorithm. Supervised methods utilize text semantics through classification or sequence labeling. Sarkar et al. [
32] used naive Bayes for keyword extraction and found that the naive Bayes classifiers with suitable discretization algorithms have good extraction performance; Guleria et al. [
33] used SVM combined with a feature extraction method to effectively improve the performance of keyword extraction, which is significantly better than traditional methods such as SingleRank, Expand Rank, and baseline TF-IDF in terms of accuracy. Li et al. [
34] proposed a keyword extraction method combining a phrase-level attention mechanism and conditional random field, which effectively captures phrase-level features and combines them with word-level features to improve the performance of keyword extraction.
Unsupervised methods do not require manually annotated training data. Through word co-occurrence and graph models, they can better capture the structural relationship within the text. However, the inability to fully utilize the semantic information of the text and dynamically adjust parameters limits semantic understanding. Supervised methods can more accurately understand and extract keywords from texts by learning from labeled data. However, the need for extensive labeled data leads to high costs, long model training times, and substantial consumption of computing resources. These constraints are particularly problematic in educational settings, where rapid deployment and low-resource adaptability are often required.
In recent years, the progress of deep learning technology has significantly improved the ability of key marker identification and text feature extraction. Cheng et al. [
35] combined part-of-speech tagging with the BiLSTM-CRF model, significantly improving the accuracy of keyword extraction. This approach ensures the correlation between words while realizing data time series and semantic information mining. The introduction of the Transformer models has further improved the effect of key marker identification. Sun et al. [
36], based on the BERT model, captured both local phraseness and global informativeness when extracting key phrases. BERT is highly flexible and can be adjusted and applied between unsupervised learning and supervised learning according to the needs of pre-training, fine-tuning, etc. Through pre-training on a large-scale corpus, it can capture richer contextual information and semantic relationships. On this basis, combined with the self-attention mechanism, these models show higher flexibility and accuracy when processing complex texts. Xu et al. [
37] elaborated on the advantages of Transformer in multimodal data processing, emphasizing its ability to capture contextual information and semantic relationships. In addition, Yan et al. [
38] also showed that the Transformer model has significant advantages in feature extraction and processing complex text. Through pre-training on a large-scale corpus, it can capture richer contextual information and semantic relationships. On this basis, combined with the self-attention mechanism, these models show higher flexibility and accuracy when processing complex texts. However, for real-time educational applications, lightweight adaptations of these models—such as those enabling CPU-based inference—are essential to ensure accessibility and responsiveness in diverse classroom environments.
In summary, while the Toulmin model has shown wide value in automated argument analysis, two persistent gaps limit its educational impact: (1) the dominance of sentence-level modeling that ignores fine-grained argument spans, and (2) the underutilization of explicit lexical markers that could enhance both accuracy and explainability. Moreover, existing systems like those in [
24,
25,
26,
27] operate as closed-loop pipelines with high latency and limited user interaction—failing to support the human-in-the-loop, real-time feedback required in smart learning ecosystems. Crucially, future work must prioritize not just technical accuracy but also pedagogical usability—ensuring that automated systems support, rather than replace, human judgment in learning contexts. The combination of the Toulmin model and deep learning will bring more possibilities for automated argument analysis, which will not only drive deeper academic research but also bring new development opportunities for educational practice.
3. Identification of Argument Elements Based on Key Markers
3.1. HCI System Design for Collaborative Argument Analysis
The core innovation of this study is a dual-workflow human–AI collaboration system that integrates advanced NLP with interactive user control.
This research aims to leverage HCI principles in developing an intuitive and efficient automated system for identifying Toulmin’s argument elements. The system design focuses on improving the user experience by providing real-time feedback to users during the essay evaluation process. Such immediacy is particularly valuable in educational contexts, where timely formative feedback can significantly enhance students’ argumentative writing development.
We integrated a deep learning-based architecture that allows users to adjust the parameters or review automatic annotations interactively. The human side shows a simple user journey: uploading text, reviewing results, and making adjustments. The machine side reveals the technical processes: text preprocessing, multi-level feature extraction using BERT-CNN, BiLSTM, and attention mechanisms, followed by a KFF-Transformer identifying Toulmin’s argument elements. The system incorporates user feedback for continuous improvement, combining advanced NLP techniques with human oversight. The HCI flow chart shown in
Figure 2 depicts a text analysis system with parallel human and machine workflows.
As shown in
Figure 2, this system enables iterative refinement of argument analysis through three key phases:
- (1)
User-Driven Workflow: Integrating human expertise
The first phase involves users uploading English argumentative essays via an intuitive web interface, where the system generates initial visual annotations of Toulmin elements (e.g., claims, data, counterarguments) using color-coded highlighting; users can then review these results by hovering over tags to inspect confidence scores and contextual explanations, and dynamically adjust classifications through dropdown menus or boundary edits, ensuring human oversight corrects potential AI misclassifications before finalization. This interactive review process prioritizes user control, reducing cognitive load and allowing educators or learners to refine argument validity based on their expertise, while the visual feedback mechanism fosters transparency and engagement. By externalizing the AI’s reasoning through explainable markers and confidence indicators, the system supports metacognitive awareness—helping learners understand not just what is wrong, but why.
- (2)
AI-Driven Processing: Automated Argument Decomposition
The second phase shifts to automated computational processing, where the machine executes a multi-stage pipeline starting with text preprocessing (sentence segmentation, tokenization, POS tagging, and dependency parsing), followed by multi-level feature extraction using a CNN for local n-gram patterns and BiLSTM-Attention for global discourse dependencies; this culminates in the KFF-Transformer model fusing key marker embeddings, part-of-speech encodings, and positional information through a multi-head attention mechanism to classify Toulmin elements, with adaptive computation concentrating resources on marker-dense regions to achieve real-time performance. This optimized workflow reduces identification time by 18.9% compared to traditional methods, enabling swift responses during writing sessions while maintaining high accuracy through attention-based weighting of contextual features. The efficiency gain is critical for classroom integration, where latency must be minimized to sustain learner engagement during drafting or peer review activities.
- (3)
Collaborative Feedback Loop: Co-Adaptive Refinement
The third phase establishes a collaborative feedback loop that bridges human and AI workflows, as user adjustments trigger incremental model fine-tuning via transfer learning—capturing discipline-specific writing patterns (e.g., legal vs. scientific arguments) in personalized marker dictionaries—and logging all corrections as (input, correction) pairs for pedagogical analytics, such as identifying common misconception trends or supporting longitudinal skill assessment. This bidirectional adaptation embodies Horvitz’s mixed-initiative interaction principles [
39], allowing the system to evolve with user inputs, thereby enhancing argument analysis iteratively and supporting continuous improvement in writing quality through co-adaptive refinement. Over time, this loop transforms the system from a static analyzer into a personalized writing coach, aligning with the vision of smart learning environments that adapt to individual learner trajectories.
Figure 3 presents the interface of our Toulmin AI Recognition System, featuring four integrated modules designed for collaborative argument analysis. The Element Selection panel (top-left) provides color-coded identification of argument components with red for claims, blue for data, and green for rebuttals. An adjustable Confidence Threshold slider (top-right) allows users to fine-tune annotation sensitivity between 0.5 and 0.9, with a default setting of 0.7. The interactive workspace supports real-time boundary adjustment through draggable selection handles, while the Personalized Markers module (bottom-right) enables custom annotation patterns through an expandable sidebar interface. This unified design supports mixed-initiative interaction by maintaining AI automation while preserving precise human control through intuitive visual encoding and adaptive controls. Notably, the interface is designed for both novice learners (who benefit from guided defaults) and expert instructors (who require granular control), reflecting inclusive design principles in educational technology.
3.2. Automatic Key Marker Mining Algorithm
This study designs a new deep learning framework named “Automatic Key Marker Mining Algorithm”. It aims to automatically mine other potential key markers based on known key markers. For example, given the sentence “Students should wear uniforms because it promotes equality”, the algorithm identifies “because” as a key marker signaling the Data component. Similarly, “however” or “on the contrary” would be mined as markers for Rebuttal. The key marker mining flow chart is shown in
Figure 4, with the specific implementation steps of the algorithm described as follows.
First, this algorithm uses a convolutional neural network (CNN) to analyze short-distance information. Since text is one-dimensional data, the CNN effectively extracts feature vectors from text data by applying one-dimensional convolution kernels. Subsequently, in order to process long-distance information, we integrate Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanisms. The bidirectional structure of BiLSTM allows the algorithm to consider both the preceding and following contexts, capturing the long-distance dependencies in text and providing optimal prediction at each moment. The attention mechanism weights different parts of the input sequence with its own weight, so that it is more flexible when processing input data and selectively focuses on useful information, such as key markers, to improve model performance. This selective focus mimics how skilled readers prioritize rhetorical cues—a capability essential for modeling human-like argument comprehension in educational AI.
In order to further enhance the accuracy and reliability of marker prediction, we introduce the conditional random field (CRF) model. The CRF utilizes dependency constraints automatically learned from training data, such as the interdependence between markers and the inability of the marker “I” to appear alone, compensating for potential invalid or disordered labeling problems in sequence labeling tasks, thus ensuring that the predicted sequence meets specific standards. By adding a CRF layer after the output layer of BiLSTM-Attention, this algorithm can consider the local features of the entire sequence, thereby achieving the globally optimal labeling sequence. This constraint-based post-processing is especially important in educational applications, where grammatical and rhetorical plausibility must be preserved to avoid misleading learners with syntactically anomalous suggestions.
3.3. Key Marker-Based Argument Element Identification Model
Building on the previous section, this paper proposes a deep learning model named KFF-Transformer, grounded in key argumentative markers. Here, “KFF” denotes Key Feature Fusion—a novel module that dynamically integrates sentence-level and discourse-level features through an attention-weighted combination of key markers, part-of-speech tags, and positional cues. The term “Transformer” does not imply a full Transformer encoder, but rather highlights the use of a Transformer-style attention mechanism within the KFF module to enable interpretable, context-aware fusion of multi-source features. Although the sequence modeling backbone employs a BiLSTM (rather than a standard Transformer block), we refer to the overall framework as KFF-Transformer to emphasize this critical attention-driven design, which directly supports fine-grained, pedagogically meaningful analysis of Toulmin elements.
This model integrates annotated and expanded key markers into sentences to achieve word-level recognition, enabling direct identification of the six elements of the Toulmin model in sentences containing specific marker phrases or even single word. This marker-first strategy not only boosts precision but also enhances model interpretability—allowing teachers to trace AI decisions back to explicit linguistic cues, a crucial requirement for trustworthy educational AI.
The argumentation element identification model of English argumentative essays based on markers is shown in
Figure 5. In the model, “Sen” represents the sentence in the argumentative essay and “Pos” represents the part-of-speech tagging; both are encoded using BERT [
12], a pre-trained Transformer-based language model, to obtain contextualized semantic vectors. “Keyword” represents the key marker, “position” represents the position of each sentence in the text, “x1” is the encoding of the sentence-level text feature obtained through the linear layer, and “x2” is the text feature encoding at the chapter level.
It should be emphasized that while our framework leverages BERT—a genuine Transformer-based encoder—for foundational contextual representation, the core sequence labeling module of KFF-Transformer is built upon a BiLSTM network integrated with an attention mechanism and does not implement a standard Transformer block (e.g., multi-head self-attention with feed-forward sublayers). The name “KFF-Transformer” thus reflects both the use of BERT and the system’s functional role as a transformer from unstructured text to structured argument annotations.
The workflow of the KFF-Transformer model is as follows: when traversing a phrase or word containing key markers (regardless of case, plurality, or tense), the corresponding category of the six elements is immediately identified. For instance, in the sentence “Although some argue uniforms stifle creativity, this ignores their role in reducing distractions”, the model labels “stifle creativity” as Counter Claim and “reducing distractions” as Rebuttal—demonstrating fine-grained, intra-sentence identification. If not, the BERT model is used to encode the text to generate semantic vectors for the sentences. Then, these semantic vectors are sent to the first level of the model together with the expanded key markers obtained in the previous section, and the Bidirectional Long Short-Term Memory (BiLSTM) is combined with the attention mechanism (Attention) to construct sentence-level text feature representations. Simultaneously, we construct a text-level representation based on the feature expression obtained in the first layer by combining the position information of the text. Subsequently, the six elements of the Toulmin model are identified by fusing inter-sentence features and chapter-level features.
Among them, introducing the features of key markers into the argument mining task, the model’s ability to analyze and understand arguments can be enhanced. These features provide valuable clues and contextual information, enabling the model to more accurately evaluate, analyze, and generate persuasive arguments. The specific calculation formula is shown in Equation (1).
“Sen” represents the sentence in the argumentative essay, and “Keyword” is the sentence encoded by the model in the previous section. After “Sen” and “Keyword” are concatenated and passed into BiLSTM, the result of feature fusion is obtained.
The KFF-Transformer model fully considers the fact that different Toulmin element identification tasks have varying requirements for textual features, and the contribution of text features at different levels to argument mining also varies. Therefore, this section adopts the feature-weighted fusion method to combine sentence-level and paragraph-level features, aiming to improve identification accuracy and stability. The feature weighting formula is shown in Equation (2).
“W” is the weight matrix, which is randomly initialized and updated with model training. “” denotes the paragraph-level representation obtained by encoding the concatenation of sentence features, POS features, and positional information with a BiLSTM. “Linear” represents the linear layer, which reduces the dimension of sentence feature expression by using the linear layer. “” denotes the vector representation of part-of-speech (POS) tags aligned with the sentence. This dynamic weighting mechanism allows the model to adaptively emphasize either local marker evidence or global discourse context depending on the argumentative genre—a flexibility that mirrors expert human judgment and supports cross-domain applicability in diverse educational settings.
5. Conclusions
This study presents KFF-Transformer, a human–AI collaborative framework for fine-grained argument analysis that integrates structured argumentation theory with advanced deep learning models and interactive system design. By embedding Toulmin’s model into an intelligent computing system, the proposed framework moves beyond traditional static scoring approaches and demonstrates how real-time, explainable, and collaborative AI systems can be realized in practice.
Experimental results show that KFF-Transformer achieves notable improvements in both efficiency and accuracy, including a 3.7 percentage-point increase in accuracy, a 2.8-point improvement in F1-score, and a processing time reduction of 18.9% on CPU—with near-real-time inference (3.3 s) achieved on GPU platforms. Critically, the human-in-the-loop interaction mechanism not only improves annotation reliability but also empowers users to actively shape system behavior through dynamic corrections, thereby fostering a co-adaptive writing support experience.
From a system perspective, the core contribution of this work lies in the synergistic integration of fine-grained, key marker-driven argument identification, a mixed-initiative interaction paradigm, and an efficient, hardware-aware implementation. Specifically, the framework grounds its predictions in linguistically meaningful key markers to enhance interpretability; it preserves human agency by enabling users to review, correct, and refine automated suggestions in real time; and it is engineered for responsiveness on widely available computing devices—from laptops equipped with consumer-grade GPUs to cloud-based infrastructures—ensuring practical deployability without sacrificing performance. Together, these features position KFF-Transformer as a practical exemplar of trustworthy, human-centered AI in educational technology.
From a computing and artificial intelligence perspective, this work contributes a scalable, interpretable, and responsive argument mining framework that successfully bridges deep learning architectures with interactive design principles. While recent large transformer-based models offer strong performance, they often lack transparency and fine-grained interpretability. In contrast, our KFF-Transformer grounds its predictions in explicit, token-level key markers—providing not only high accuracy but also clear, pedagogically meaningful explanations for each identified argument element. This transparency makes the system particularly suitable for educational settings, where understanding why a claim or rebuttal was detected is as crucial as the detection itself.
Nevertheless, error analysis on misclassified instances reveals several limitations. The model occasionally fails when argument elements lack explicit key markers (e.g., implicit claims like “Uniforms are unfair”), struggles with syntactically nested rebuttals (e.g., “Although X, which challenges Y, Z remains valid”), and exhibits reduced robustness on domain-specific jargon outside the training distribution. These cases highlight the current reliance on marker-driven signals—a limitation partially mitigated in practice by the human-in-the-loop interface, which allows users to correct ambiguous outputs during interactive review.
Future work will focus on extending the framework to multilingual and cross-domain argumentative texts, integrating large language models (LLMs) for hybrid symbolic-neural reasoning, and optimizing the pipeline for large-scale deployment in cloud-native intelligent tutoring systems. Additionally, we plan to investigate long-term user engagement and learning gains through classroom-based longitudinal studies.