KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification

Cai, Xuxun; Yang, Jincai; Zheng, Meng; Zhu, Jianping

doi:10.3390/app16031451

Open AccessArticle

KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification

by

Xuxun Cai

^1,*,

Jincai Yang

²,

Meng Zheng

¹

and

Jianping Zhu

³

¹

Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China

²

School of Computer Science, Central China Normal University, Wuhan 430079, China

³

School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1451; https://doi.org/10.3390/app16031451 (registering DOI)

Submission received: 19 December 2025 / Revised: 22 January 2026 / Accepted: 28 January 2026 / Published: 31 January 2026

(This article belongs to the Special Issue Application of Smart Learning in Education)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of intelligent computing and artificial intelligence, there is an increasing demand for efficient, interpretable, and interactive frameworks for fine-grained text analysis. In the field of argument mining, existing approaches are often constrained by sentence-level processing, limited exploitation of key linguistic markers, and a lack of human–AI collaborative mechanisms, which restrict both recognition accuracy and computational efficiency. To address these challenges, this paper proposes KFF-Transformer, a computing-oriented human–AI collaborative framework for fine-grained argument element identification based on Toulmin’s model. The framework first employs an automatic key marker mining algorithm to expand a seed set of expert-labeled linguistic cues, significantly enhancing coverage and diversity. It then employs a lightweight deep learning architecture that combines BERT for contextual token encoding with a BiLSTM network enhanced by an attention mechanism to perform word-level classification of the six Toulmin elements. This approach leverages enriched key markers as critical features, enhancing both accuracy and interpretability. It should be noted that while our framework leverages BERT—a Transformer-based encoder—for contextual representation, the core sequence labeling module is based on BiLSTM and does not implement a standard Transformer block. Furthermore, a human-in-the-loop interaction mechanism is embedded to support real-time user correction and adaptive system refinement, improving robustness and practical usability. Experiments conducted on a dataset of 180 English argumentative essays demonstrate that KFF-Transformer identifies key markers in 1145 sentences and achieves an accuracy of 72.2% and an F1-score of 66.7%, outperforming a strong baseline by 3.7% and 2.8%, respectively. Moreover, the framework reduces processing time by 18.9% on CPU and achieves near-real-time performance of approximately 3.3 s on GPU. These results validate that KFF-Transformer effectively integrates linguistically grounded reasoning, efficient deep learning, and interactive design, providing a scalable and trustworthy solution for intelligent argument analysis in real-world educational applications.

Keywords:

intelligent computing; argument mining; human–AI; KFF-transformer; fine-grained argument element identification; key marker mining

1. Introduction

Recent advances in computing platforms and artificial intelligence have enabled the deployment of intelligent systems that emphasize real-time interaction, adaptability, and user-centered design. Human–computer interaction (HCI) has emerged as a crucial element in developing user-centric writing assessment systems [1]. With the increasing integration of intelligent systems into education, HCI-enabled smart learning environments are now expected not only to reduce cognitive load but also to foster metacognitive awareness and support formative assessment practices that enhance learning effectiveness [2]. Unlike traditional static evaluations, HCI-enabled systems provide dynamic, personalized feedback, enhancing the learning process and student engagement [3]. In the domain of academic writing, especially argumentative essays, such systems help students improve their arguments through interactive feedback [4]. This iterative refinement is particularly valuable in educational settings where developing logical reasoning and persuasive communication is a core learning objective.

Toulmin’s model [5] provides a structured and interpretable framework that is well suited for computational modeling in HCI-enabled writing assessment. By breaking down arguments into six distinct components, namely—claim, data, counter claim, counter data, rebuttal, and rebuttal data—the model enables intelligent systems to assess not only the presence of argumentation but also its quality, structure, and depth. This granularity aligns well with the requirements of interactive writing platforms, which aim to provide detailed feedback and personalized guidance to users in real time. Furthermore, the universality of Toulmin’s framework enables its adaptation to diverse linguistic contexts and educational objectives, making it a practical foundation for intelligent feedback mechanisms in HCI-based writing environments. Its alignment with established pedagogical standards, such as those in critical thinking curricula and EFL writing instruction, further strengthens its relevance for smart learning applications. The system design adheres to Sweller’s Cognitive Load Theory [6] and Anderson’s Adaptive Learning Framework [7]. HCI-enabled systems leverage interactive interfaces and adaptive algorithms to support dynamic user engagement and iterative refinement, thereby improving both efficiency and usability.

The six elements of the Toulmin model represent the most comprehensive classification of argumentative elements, providing clear standards for evaluating the effectiveness, depth, and rationality of arguments, and serving as a key tool for measuring the logical rigor and persuasiveness of argumentative essays. The Toulmin model provides a well-defined analytical framework that is highly suitable for computational implementation. Its explicit structure supports explainability and aligns with the requirements of interactive intelligent systems, making it a promising foundation for AI-driven argument analysis.

However, the integration of HCI-based intelligent systems with Toulmin’s model remains underexplored, particularly when it comes to automating the identification of argument elements using advanced machine learning techniques. This study bridges that gap by developing an automated system that employs HCI principles alongside advanced text analysis models, dynamically adapts to the argument structure, and enhances the identification of Toulmin’s elements, supporting technical support for intelligent educational systems in more efficient essay evaluation [8]. Such integration has the potential to support smart learning, where AI can serve not as an evaluator but as a collaborative partner in the writing development process.

However, it is not easy to identify the six elements of the Toulmin model. In traditional analysis methods, researchers usually need to analyze the article sentence by sentence and manually assess its argumentation elements, which takes much time and energy and is subject to the bias and limitations of subjective judgment. Automatic identification of the six elements of the Toulmin model has become an important research direction in the field of natural language processing. This approach not only enhances the efficiency of understanding and evaluating argumentative essays but also provides technical support for intelligent education systems, aiding teachers in automatic scoring and feedback. In classroom contexts, this automation can free instructors from repetitive annotation tasks, allowing them to focus on higher-order instructional interventions.

Traditional rule-based or shallow machine learning methods cannot fully capture the complex semantic relationships and contextual information in sentences, resulting in inaccurate identification of elements. At present, deep learning technology has been used to automatically mine and expand the key markers of the six elements of the Toulmin model. However, existing deep learning approaches for Toulmin element identification face two key limitations. First, they often rely on sentence-level analysis, which struggles with complex or implicit argument structures. Second, they underutilize lexical markers that are crucial for fine-grained component recognition. These shortcomings limit both accuracy and efficiency, particularly in real-time educational settings.

Recent studies have explored the integration of NLP and HCI techniques in writing assessment systems. For example, the Write & Improve platform developed by Cambridge University leverages AI to provide automated grammar and coherence suggestions, helping learners iteratively revise their writing (Bannò et al. [9]). Similarly, Bai et al. [10] highlight the importance of using a human–AI collaborative feedback system to support EFL writers, suggesting that this hybrid approach is a potent tool for improving writing performance by addressing both cognitive and affective learning domains. While these systems have improved user engagement and surface-level accuracy, they largely focus on low-level language features such as syntax, word choice, or fluency. They fail to address the deeper rhetorical and logical dimensions of writing, particularly in argumentation quality. Moreover, most HCI writing tools do not adopt a structured argumentation framework like Toulmin’s, resulting in feedback that lacks interpretability. Therefore, there is a pressing need for HCI-based writing systems that incorporate a fine-grained argumentative framework, allowing the system to identify argument components. This direction is especially critical in smart learning ecosystems, where the goal is not just error correction but the cultivation of higher-order thinking skills through scaffolded, theory-informed AI assistance.

Current HCI-enabled writing tools (e.g., Write & Improve [9]) focus on surface-level feedback (grammar, fluency) but lack structured argument analysis capabilities, which limits their pedagogical value for developing critical thinking.

To address these challenges, this study proposes KFF-Transformer, an intelligent argument analysis framework that integrates deep learning-based key marker mining with a human-in-the-loop HCI system. By combining word-level argument identification with interactive user feedback, the proposed system bridges the gap between advanced argument mining techniques and practical intelligent applications. The main contributions of this work are threefold:

(1): A key marker–driven deep learning architecture for fine-grained identification of Toulmin argument elements;
(2): An interactive human–AI collaborative workflow that supports real-time correction and adaptive refinement;
(3): An efficient system design validated on both CPU and GPU platforms, demonstrating feasibility for real-world deployment.

These features collectively position KFF-Transformer as a smart learning tool that supports both learners’ argumentative competence and educators’ instructional capacity. Such design aligns with Grudin’s principles of shared control in collaborative systems [11] and reduces cognitive load [2].

It should be noted that our framework utilizes BERT [12]—a pre-trained Transformer-based encoder—as the underlying contextual representation backbone. However, the core sequence labeling module responsible for Toulmin element identification is built upon a BiLSTM network enhanced with an attention mechanism and does not employ a standard Transformer block (e.g., multi-head self-attention followed by position-wise feed-forward layers). The name “KFF-Transformer” thus reflects both the foundational role of BERT and the framework’s functional purpose as a transformer from unstructured discourse to structured argument annotations.

Our work is grounded in Toulmin’s model of argumentation [5], a foundational framework in educational and computational discourse analysis. Originally proposed by Stephen Toulmin in The Uses of Argument (1958), it decomposes arguments into six core components: claim, data, counter claim, counter data, rebuttal, and rebuttal data. This model has been widely adopted in writing instruction, critical thinking pedagogy, and automated argument mining due to its interpretability and alignment with human reasoning processes [13,14]. The structured output of our KFF-Transformer aligns with principles of pedagogically usable analytics [15], supporting the potential for formative feedback in argumentation instruction.

2. Related Work

The related work on the identification of argument elements in English argumentative essays is introduced from three aspects: the Toulmin model, the framework of argumentation identification technology, and the identification of key markers. First, the Toulmin model clearly defines the claim, data, and their logical relations in an argumentative essay. Second, the framework of argumentation identification technology can be used to automatically identify and analyze argument elements. Finally, the identification of key markers can effectively improve the efficiency and accuracy of identifying the six Toulmin elements. These three dimensions collectively underpin the development of intelligent writing support systems that align with the goals of smart learning—namely, real-time, interpretable, and pedagogically meaningful feedback.

2.1. Toulmin Model

Toulmin’s argument model, proposed by British philosopher Stephen Toulmin [5], is widely used in argument analysis and argumentative writing. This model consists of six core elements: claim, data, counter claim, counter data, rebuttal, and rebuttal data. It provides writers with a systematic model to logically develop their arguments and allows readers to better understand the logical structure of the argumentation.

Liu [16] pointed out that the Toulmin model demonstrates unique advantages in essay writing. It not only helps authors clearly organize and elaborate on complex arguments, but also shows students an effective method of analyzing and evaluating arguments, thus enhancing their critical thinking ability. With clear definitions of claims and data, the model guides authors on how to effectively use evidence to support their positions. At the same time, rebuttals and modal qualifiers encourage authors to consider potential opposing viewpoints and enhance the persuasiveness and rigor of their arguments by setting reasonable boundaries. This structured approach makes the Toulmin model particularly suitable for integration into educational AI tools, where transparency and scaffolding are essential for formative learning.

Although the Toulmin model is useful in practice, researchers may encounter difficulties in dealing with complex and multi-layered argument structures, especially in the case of multi-party debates, where a simple six-element framework may not be able to cover all the details of the argument. Additionally, although the Toulmin model has been widely used in English writing, its applicability in non-English contexts has not been fully tested, especially for texts with different language structures and cultural backgrounds, the effect of the model may not be ideal. Another common problem is that the Toulmin model relies on the author’s judgment of each element during its application. Its effectiveness largely depends on the author’s writing experience and logical thinking ability. Therefore, it may be challenging for beginners to identify and distinguish these elements. This highlights the need for AI-augmented systems that can guide novice learners through the Toulmin framework via interactive, adaptive feedback—aligning with the principles of smart learning environments.

2.2. Framework of Argument Identification Technology

The flow chart of argument identification is shown in Figure 1. This covers four subfields: identification of argumentative sentences (which have a clear position and include argument elements), classification of argument elements, analysis of argument structure, and identification of argument logic. These sub-tasks together form our comprehensive understanding and analysis of argumentative essays [17].

(1): Argumentative sentence identification technology: The core of this technology lies in distinguishing argumentative sentences in the text, which is essentially a binary classification problem. Laha et al. [18] were the first to introduce deep learning into the field of argument mining to achieve argument boundary detection and argumentative component identification.
(2): Argument element classification technology: This technology focuses on dividing the sentences into different categories, such as claims, data, etc. Argument element classification has evolved from traditional supervised learning to using deep learning models like RNN and BiLSTM. For example, Kusmantini et al. [19] used support vector machines to classify argumentative texts. Li et al. [20] proposed a joint learning RNN model based on the attention mechanism [21] to address the argument boundary detection issue.
(3): Argument structure analysis technology: It aims to identify the relationship between argument elements in the text, such as support and opposition. Although this task relies on data in a specific field in many cases, the generalization ability and accuracy of the model are constantly improving. For example, Stab et al. [22] used binary classification SVM to classify argument identification relations into support or opposition in their research on argument structure identification.
(4): Argument logic identification technology: This primarily focuses on parsing the relationships between annotated argument elements in the text, which is a complex task requiring the model to not only understand the content but also have a deep insight into the logical relationship behind it. Orith Toledo-Ronen et al. [23] explored the feasibility of conducting argumentative analysis tasks in non-English environments by using the multilingual BERT model combined with transfer learning strategies. The results show that such methods are well suited for classifying the stance of arguments, but less so for assessing the quality of arguments.

These technologies solve different tasks in argument mining but have certain limitations: Argumentative sentence identification technology is overly reliant on manually designed features, which may struggle with processing deep textual features. Argument element classification technology such as RNN models have high requirements for data quality and computing resources. In terms of argument structure analysis technology, the SVM method of Stab et al. [22] performs well but lacks generalization ability, especially on cross-domain data. The BERT model can parse complex argument logic, but its adaptability to non-English data and complex argument structures remains a challenge. Moreover, most existing systems operate as closed-loop pipelines with limited capacity for user interaction—limiting their utility in educational settings where learner agency and teacher oversight are essential.

Current research often focuses on single subtasks, such as claim and data identification and argument structure analysis, and often ignores the integrity of argument recognition and the interdependence between subtasks. In summary, existing technologies have their own strengths in dealing with the identification of argument elements in English argumentative essays. Understanding the characteristics of these methods is crucial for selecting the appropriate argument identification technologies.

With the development of natural language processing technology, the Toulmin model has begun to be incorporated into automated analysis tools. For example, Alkhawaldeh et al. [24] proposed various deep learning models such as Lexical Chain with Multi-Head Attention and a multi-column convolutional neural network to automatically generate Toulmin arguments, demonstrating that combining them with reinforcement learning agents can improve the accuracy of generating and reasoning Toulmin arguments. Mirzababaei et al. [25] developed a dialogue agent system based on the Toulmin model in the context of educational technology that can identify structural errors in arguments. The classifier developed in the study can detect argument elements such as claims, data, etc., and provide feedback in dialogue to help users improve argument quality. This human-in-the-loop design exemplifies how AI can function as a collaborative tutor rather than a static evaluator—a key principle in smart learning. Yang et al. [26] proposed a method that combines weighted features and the BiLSTM-Attention model for argument mining in EFL writing. This method generates dynamic word vectors and obtains sentence-level and article-level features through the BiLSTM and Attention mechanisms, thus annotating the article content according to the Toulmin model. Fromm et al. [27] used the Toulmin model to analyze argument structure in reviews, employing a BERT-based argument mining technology model to automatically detect claims, data, etc., in review texts, thus improving the efficiency of the review process.

In summary, the application of the Toulmin model in argumentative writing shows its wide value. However, the limitations of this model in dealing with complex arguments and its adaptability in cross-language environments still need to be solved through further research and exploration. Crucially, future work must prioritize not just technical accuracy but also pedagogical usability—ensuring that automated systems support, rather than replace, human judgment in learning contexts. The combination of the Toulmin model and deep learning will bring more possibilities for automated argument analysis, which will not only drive deeper academic research, but also bring new development opportunities for educational practice.

Despite these advances, a critical limitation persists in many Toulmin-based mining systems: their reliance on sentence-level classification. Models like the FWFBA framework proposed by Yang et al. [26] assign a single Toulmin label per sentence, which inherently fails to capture intra-sentence argument spans or cross-sentence dependencies—common patterns in student writing where a claim may span two clauses or data appears across multiple sentences. This coarse granularity leads to structural oversimplification, resulting in an element identification accuracy of only 69.6%, as reported by the authors. More importantly, such approaches require scanning entire essays to infer logical connections, increasing computational overhead and error propagation in complex or implicit arguments. For formative feedback in educational settings, where precision at the clause or phrase level is essential, this limitation significantly undermines pedagogical utility.

Furthermore, while lexical markers (e.g., “because” for Data, “however” for Rebuttal) are well-established rhetorical cues in argumentation theory [5,13], most deep learning models—including Yang et al.’s [26] BiLSTM-Attention pipeline—treat them implicitly through contextual embeddings without explicit exploitation. Their method does not incorporate a dedicated module to detect, validate, or weight these markers based on discourse function, missing opportunities for interpretability and error correction. Compounding this issue, the FWFBA model requires approximately 23.3 s per batch for inference, a latency that is impractical for real-time classroom feedback. In smart learning environments, where immediacy and interactivity are core design principles [2,11], such inefficiency limits teacher adoption and learner engagement. These shortcomings highlight the need for architectures that explicitly integrate linguistic knowledge with efficient sequence labeling, enabling both high accuracy and low-latency deployment.

2.3. Deep Learning Approaches for Sequence Labeling

With the development of deep learning, various neural network architectures have been applied to sequence labeling tasks in argument mining. Convolutional neural networks (CNNs) are effective at capturing local lexical patterns, while recurrent neural networks (RNNs) and their bidirectional variants (BiLSTMs) model sequential dependencies by processing text in forward and backward directions, making them well-suited for identifying argument elements such as claims and premises. BiLSTM-based models, in particular, have demonstrated strong performance in educational argument analysis due to their ability to encode contextual information from both past and future tokens.

More recently, Transformer-based architectures have emerged as a powerful alternative, leveraging self-attention mechanisms to model long-range dependencies between tokens regardless of their positional distance. Unlike recurrent models, Transformers enable parallel computation over input sequences and excel at capturing global contextual relationships—properties that are highly beneficial for analyzing complex, multi-component argumentative structures. The success of attention mechanisms has also extended beyond natural language processing; for example, transformer models have been adapted for the automated acquisition of static street view images to analyze building characteristics [28], and attention-based approaches have shown strong cross-building transferability in fault detection and diagnosis for air handling units in auditoriums and hospitals [29].

Despite these advances, existing studies in educational argument mining rarely integrate explicit linguistic cues—such as domain-specific key argumentative markers—with structured knowledge about argumentation schemes. As a result, models often fail to leverage the semantic signals that human annotators rely on when identifying Toulmin elements. Therefore, integrating domain knowledge with key argumentative markers is essential to improve the performance and interpretability of argument element recognition in educational contexts.

2.4. Identification of Key Markers

Identifying argument elements in English essays relies heavily on detecting rhetorically significant phrases known as key markers—not merely high-frequency keywords. Unlike general keyword extraction (e.g., TF-IDF or TextRank), argumentative key markers signal specific Toulmin functions: “since” often introduces Data, “nevertheless” signals Rebuttal, and “my view is that” may frame a claim. Effective identification of these markers is crucial for machines to comprehend argument structure and provide interpretable, pedagogically meaningful feedback.

Identifying argument elements in English essays is a critical step in analysis. This process usually relies on the identification of specific words in the text, known as “key markers”. For example, the phrase “for example” is often used to mark the beginning of an illustration, while “I think” may indicate the author’s personal opinion. Effective identification of these key markers is crucial for machines to comprehend the structure and argumentative style of a text. In educational applications, such markers serve as teachable cues that can be highlighted to learners, helping them internalize rhetorical conventions through AI-mediated scaffolding.

The identification of key markers is different from traditional keyword extraction. Keywords usually refer to high-frequency important words in the text. The identification of key markers focuses on mining the important elements of the argument structure in the text, which is of great significance for improving the accuracy and depth of argument analysis.

Early keyword extraction technologies originated from the word-frequency-based method proposed by Luhn [30] in 1957. These methods have evolved into two methods: unsupervised and supervised. Unsupervised methods such as TextRank and the one from Mihalcea et al. [31] rely on statistical features and graph models to determine the importance of words through word co-occurrence and the PageRank algorithm. Supervised methods utilize text semantics through classification or sequence labeling. Sarkar et al. [32] used naive Bayes for keyword extraction and found that the naive Bayes classifiers with suitable discretization algorithms have good extraction performance; Guleria et al. [33] used SVM combined with a feature extraction method to effectively improve the performance of keyword extraction, which is significantly better than traditional methods such as SingleRank, Expand Rank, and baseline TF-IDF in terms of accuracy. Li et al. [34] proposed a keyword extraction method combining a phrase-level attention mechanism and conditional random field, which effectively captures phrase-level features and combines them with word-level features to improve the performance of keyword extraction.

Unsupervised methods do not require manually annotated training data. Through word co-occurrence and graph models, they can better capture the structural relationship within the text. However, the inability to fully utilize the semantic information of the text and dynamically adjust parameters limits semantic understanding. Supervised methods can more accurately understand and extract keywords from texts by learning from labeled data. However, the need for extensive labeled data leads to high costs, long model training times, and substantial consumption of computing resources. These constraints are particularly problematic in educational settings, where rapid deployment and low-resource adaptability are often required.

In recent years, the progress of deep learning technology has significantly improved the ability of key marker identification and text feature extraction. Cheng et al. [35] combined part-of-speech tagging with the BiLSTM-CRF model, significantly improving the accuracy of keyword extraction. This approach ensures the correlation between words while realizing data time series and semantic information mining. The introduction of the Transformer models has further improved the effect of key marker identification. Sun et al. [36], based on the BERT model, captured both local phraseness and global informativeness when extracting key phrases. BERT is highly flexible and can be adjusted and applied between unsupervised learning and supervised learning according to the needs of pre-training, fine-tuning, etc. Through pre-training on a large-scale corpus, it can capture richer contextual information and semantic relationships. On this basis, combined with the self-attention mechanism, these models show higher flexibility and accuracy when processing complex texts. Xu et al. [37] elaborated on the advantages of Transformer in multimodal data processing, emphasizing its ability to capture contextual information and semantic relationships. In addition, Yan et al. [38] also showed that the Transformer model has significant advantages in feature extraction and processing complex text. Through pre-training on a large-scale corpus, it can capture richer contextual information and semantic relationships. On this basis, combined with the self-attention mechanism, these models show higher flexibility and accuracy when processing complex texts. However, for real-time educational applications, lightweight adaptations of these models—such as those enabling CPU-based inference—are essential to ensure accessibility and responsiveness in diverse classroom environments.

In summary, while the Toulmin model has shown wide value in automated argument analysis, two persistent gaps limit its educational impact: (1) the dominance of sentence-level modeling that ignores fine-grained argument spans, and (2) the underutilization of explicit lexical markers that could enhance both accuracy and explainability. Moreover, existing systems like those in [24,25,26,27] operate as closed-loop pipelines with high latency and limited user interaction—failing to support the human-in-the-loop, real-time feedback required in smart learning ecosystems. Crucially, future work must prioritize not just technical accuracy but also pedagogical usability—ensuring that automated systems support, rather than replace, human judgment in learning contexts. The combination of the Toulmin model and deep learning will bring more possibilities for automated argument analysis, which will not only drive deeper academic research but also bring new development opportunities for educational practice.

3. Identification of Argument Elements Based on Key Markers

3.1. HCI System Design for Collaborative Argument Analysis

The core innovation of this study is a dual-workflow human–AI collaboration system that integrates advanced NLP with interactive user control.

This research aims to leverage HCI principles in developing an intuitive and efficient automated system for identifying Toulmin’s argument elements. The system design focuses on improving the user experience by providing real-time feedback to users during the essay evaluation process. Such immediacy is particularly valuable in educational contexts, where timely formative feedback can significantly enhance students’ argumentative writing development.

We integrated a deep learning-based architecture that allows users to adjust the parameters or review automatic annotations interactively. The human side shows a simple user journey: uploading text, reviewing results, and making adjustments. The machine side reveals the technical processes: text preprocessing, multi-level feature extraction using BERT-CNN, BiLSTM, and attention mechanisms, followed by a KFF-Transformer identifying Toulmin’s argument elements. The system incorporates user feedback for continuous improvement, combining advanced NLP techniques with human oversight. The HCI flow chart shown in Figure 2 depicts a text analysis system with parallel human and machine workflows.

As shown in Figure 2, this system enables iterative refinement of argument analysis through three key phases:

(1): User-Driven Workflow: Integrating human expertise

The first phase involves users uploading English argumentative essays via an intuitive web interface, where the system generates initial visual annotations of Toulmin elements (e.g., claims, data, counterarguments) using color-coded highlighting; users can then review these results by hovering over tags to inspect confidence scores and contextual explanations, and dynamically adjust classifications through dropdown menus or boundary edits, ensuring human oversight corrects potential AI misclassifications before finalization. This interactive review process prioritizes user control, reducing cognitive load and allowing educators or learners to refine argument validity based on their expertise, while the visual feedback mechanism fosters transparency and engagement. By externalizing the AI’s reasoning through explainable markers and confidence indicators, the system supports metacognitive awareness—helping learners understand not just what is wrong, but why.

(2): AI-Driven Processing: Automated Argument Decomposition

The second phase shifts to automated computational processing, where the machine executes a multi-stage pipeline starting with text preprocessing (sentence segmentation, tokenization, POS tagging, and dependency parsing), followed by multi-level feature extraction using a CNN for local n-gram patterns and BiLSTM-Attention for global discourse dependencies; this culminates in the KFF-Transformer model fusing key marker embeddings, part-of-speech encodings, and positional information through a multi-head attention mechanism to classify Toulmin elements, with adaptive computation concentrating resources on marker-dense regions to achieve real-time performance. This optimized workflow reduces identification time by 18.9% compared to traditional methods, enabling swift responses during writing sessions while maintaining high accuracy through attention-based weighting of contextual features. The efficiency gain is critical for classroom integration, where latency must be minimized to sustain learner engagement during drafting or peer review activities.

(3): Collaborative Feedback Loop: Co-Adaptive Refinement

The third phase establishes a collaborative feedback loop that bridges human and AI workflows, as user adjustments trigger incremental model fine-tuning via transfer learning—capturing discipline-specific writing patterns (e.g., legal vs. scientific arguments) in personalized marker dictionaries—and logging all corrections as (input, correction) pairs for pedagogical analytics, such as identifying common misconception trends or supporting longitudinal skill assessment. This bidirectional adaptation embodies Horvitz’s mixed-initiative interaction principles [39], allowing the system to evolve with user inputs, thereby enhancing argument analysis iteratively and supporting continuous improvement in writing quality through co-adaptive refinement. Over time, this loop transforms the system from a static analyzer into a personalized writing coach, aligning with the vision of smart learning environments that adapt to individual learner trajectories.

Figure 3 presents the interface of our Toulmin AI Recognition System, featuring four integrated modules designed for collaborative argument analysis. The Element Selection panel (top-left) provides color-coded identification of argument components with red for claims, blue for data, and green for rebuttals. An adjustable Confidence Threshold slider (top-right) allows users to fine-tune annotation sensitivity between 0.5 and 0.9, with a default setting of 0.7. The interactive workspace supports real-time boundary adjustment through draggable selection handles, while the Personalized Markers module (bottom-right) enables custom annotation patterns through an expandable sidebar interface. This unified design supports mixed-initiative interaction by maintaining AI automation while preserving precise human control through intuitive visual encoding and adaptive controls. Notably, the interface is designed for both novice learners (who benefit from guided defaults) and expert instructors (who require granular control), reflecting inclusive design principles in educational technology.

3.2. Automatic Key Marker Mining Algorithm

This study designs a new deep learning framework named “Automatic Key Marker Mining Algorithm”. It aims to automatically mine other potential key markers based on known key markers. For example, given the sentence “Students should wear uniforms because it promotes equality”, the algorithm identifies “because” as a key marker signaling the Data component. Similarly, “however” or “on the contrary” would be mined as markers for Rebuttal. The key marker mining flow chart is shown in Figure 4, with the specific implementation steps of the algorithm described as follows.

First, this algorithm uses a convolutional neural network (CNN) to analyze short-distance information. Since text is one-dimensional data, the CNN effectively extracts feature vectors from text data by applying one-dimensional convolution kernels. Subsequently, in order to process long-distance information, we integrate Bidirectional Long Short-Term Memory (BiLSTM) with attention mechanisms. The bidirectional structure of BiLSTM allows the algorithm to consider both the preceding and following contexts, capturing the long-distance dependencies in text and providing optimal prediction at each moment. The attention mechanism weights different parts of the input sequence with its own weight, so that it is more flexible when processing input data and selectively focuses on useful information, such as key markers, to improve model performance. This selective focus mimics how skilled readers prioritize rhetorical cues—a capability essential for modeling human-like argument comprehension in educational AI.

In order to further enhance the accuracy and reliability of marker prediction, we introduce the conditional random field (CRF) model. The CRF utilizes dependency constraints automatically learned from training data, such as the interdependence between markers and the inability of the marker “I” to appear alone, compensating for potential invalid or disordered labeling problems in sequence labeling tasks, thus ensuring that the predicted sequence meets specific standards. By adding a CRF layer after the output layer of BiLSTM-Attention, this algorithm can consider the local features of the entire sequence, thereby achieving the globally optimal labeling sequence. This constraint-based post-processing is especially important in educational applications, where grammatical and rhetorical plausibility must be preserved to avoid misleading learners with syntactically anomalous suggestions.

3.3. Key Marker-Based Argument Element Identification Model

Building on the previous section, this paper proposes a deep learning model named KFF-Transformer, grounded in key argumentative markers. Here, “KFF” denotes Key Feature Fusion—a novel module that dynamically integrates sentence-level and discourse-level features through an attention-weighted combination of key markers, part-of-speech tags, and positional cues. The term “Transformer” does not imply a full Transformer encoder, but rather highlights the use of a Transformer-style attention mechanism within the KFF module to enable interpretable, context-aware fusion of multi-source features. Although the sequence modeling backbone employs a BiLSTM (rather than a standard Transformer block), we refer to the overall framework as KFF-Transformer to emphasize this critical attention-driven design, which directly supports fine-grained, pedagogically meaningful analysis of Toulmin elements.

This model integrates annotated and expanded key markers into sentences to achieve word-level recognition, enabling direct identification of the six elements of the Toulmin model in sentences containing specific marker phrases or even single word. This marker-first strategy not only boosts precision but also enhances model interpretability—allowing teachers to trace AI decisions back to explicit linguistic cues, a crucial requirement for trustworthy educational AI.

The argumentation element identification model of English argumentative essays based on markers is shown in Figure 5. In the model, “Sen” represents the sentence in the argumentative essay and “Pos” represents the part-of-speech tagging; both are encoded using BERT [12], a pre-trained Transformer-based language model, to obtain contextualized semantic vectors. “Keyword” represents the key marker, “position” represents the position of each sentence in the text, “x1” is the encoding of the sentence-level text feature obtained through the linear layer, and “x2” is the text feature encoding at the chapter level.

It should be emphasized that while our framework leverages BERT—a genuine Transformer-based encoder—for foundational contextual representation, the core sequence labeling module of KFF-Transformer is built upon a BiLSTM network integrated with an attention mechanism and does not implement a standard Transformer block (e.g., multi-head self-attention with feed-forward sublayers). The name “KFF-Transformer” thus reflects both the use of BERT and the system’s functional role as a transformer from unstructured text to structured argument annotations.

The workflow of the KFF-Transformer model is as follows: when traversing a phrase or word containing key markers (regardless of case, plurality, or tense), the corresponding category of the six elements is immediately identified. For instance, in the sentence “Although some argue uniforms stifle creativity, this ignores their role in reducing distractions”, the model labels “stifle creativity” as Counter Claim and “reducing distractions” as Rebuttal—demonstrating fine-grained, intra-sentence identification. If not, the BERT model is used to encode the text to generate semantic vectors for the sentences. Then, these semantic vectors are sent to the first level of the model together with the expanded key markers obtained in the previous section, and the Bidirectional Long Short-Term Memory (BiLSTM) is combined with the attention mechanism (Attention) to construct sentence-level text feature representations. Simultaneously, we construct a text-level representation based on the feature expression obtained in the first layer by combining the position information of the text. Subsequently, the six elements of the Toulmin model are identified by fusing inter-sentence features and chapter-level features.

Among them, introducing the features of key markers into the argument mining task, the model’s ability to analyze and understand arguments can be enhanced. These features provide valuable clues and contextual information, enabling the model to more accurately evaluate, analyze, and generate persuasive arguments. The specific calculation formula is shown in Equation (1).

X_{S e n} = B i L S T M (c o n c a t (S e n, K e y w o r d))

(1)

“Sen” represents the sentence in the argumentative essay, and “Keyword” is the sentence encoded by the model in the previous section. After “Sen” and “Keyword” are concatenated and passed into BiLSTM, the result of feature fusion is obtained.

The KFF-Transformer model fully considers the fact that different Toulmin element identification tasks have varying requirements for textual features, and the contribution of text features at different levels to argument mining also varies. Therefore, this section adopts the feature-weighted fusion method to combine sentence-level and paragraph-level features, aiming to improve identification accuracy and stability. The feature weighting formula is shown in Equation (2).

F = W * X_{2} + (1 - W) * L i n e a r (c o n c a t (X_{S e n}, X_{P o s}))

(2)

“W” is the weight matrix, which is randomly initialized and updated with model training. “

X_{2}

” denotes the paragraph-level representation obtained by encoding the concatenation of sentence features, POS features, and positional information with a BiLSTM. “Linear” represents the linear layer, which reduces the dimension of sentence feature expression by using the linear layer. “

X_{P o s}

” denotes the vector representation of part-of-speech (POS) tags aligned with the sentence. This dynamic weighting mechanism allows the model to adaptively emphasize either local marker evidence or global discourse context depending on the argumentative genre—a flexibility that mirrors expert human judgment and supports cross-domain applicability in diverse educational settings.

4. Experiment

4.1. Experimental Settings and Parameters

The following experiment is divided into two steps: 1. Based on the key markers manually labeled by the expert team of professors, conduct initial part-of-speech tagging by using NLTK and identify more key markers through the automatic mining key marker algorithm. 2. Use the expanded key markers to automatically identify which type of the six elements this part belongs to through the KFF-Transformer model.

The processor used in this experiment is an Intel i7-11800H @2.3 GHz (Intel, Santa Clara, CA, USA), and the graphics card is an NVIDIA RTX3080 Laptop GPU (NVIDIA, Santa Clara, CA, USA). Parameter settings: The dataset consists of 180 articles with a 9:1 ratio between the test set and training set. This split ratio was selected considering the limited availability of expert-annotated argumentative essays—a typical challenge in fine-grained argument mining—and follows the common practice in recent studies [40,41] that prioritize maximizing training data when labeled resources are scarce, while still preserving a meaningful test set for unbiased evaluation. The model design includes a single-layer BiLSTM network configured with 64 hidden layer units, a dropout rate of 0.5, a learning rate of 0.001, the optimizer used is Adam, and the experiments were iterated 500 times.

4.2. Evaluation Indicators

This study uses five indicators to comprehensively evaluate model performance: To assess the effectiveness of the Automatic Key Marker Mining Algorithm, we use variety (number of distinct new marker types) and quantity (total number of sentences containing markers). For Toulmin element identification based on key markers, we evaluate processing time, accuracy (Acc), and F1-score.

Discovering more key markers through already labeled ones, and comparing the quantity and variety of experimental models, the performance of the automatic key marker mining algorithm can be assessed intuitively, comprehensively, and reliably. In the field of named entity recognition, time is primarily used as an evaluation indicator to evaluate the performance and efficiency of the model, ensuring that it has efficient and stable performance in practical applications. Accuracy reflects the precision of correctly identified entities by the model to the total predicted entities, serving as a basic and practical indicator for evaluating model performance. The F1-score can comprehensively reflect the model’s precision (P) and recall (R), thus providing a more comprehensive testing effect. The calculation formulas for Acc, P, R, and F1-score are shown in Equations (3)–(6). In the formulas, TP is the number of entities that are correctly identified, FP is the number of entities that are incorrectly identified, TN is the number of entities that are identified as incorrect and are also incorrect, and FN is the number of entities that are identified as incorrect but are actually correct.

A c c = \frac{T P + T N}{T P + F P + T N + F N}

(3)

P = \frac{T P}{T P + F P}

(4)

R = \frac{T P}{T P + F N}

(5)

F 1 = \frac{2 * P * R}{P + R}

(6)

4.3. Experimental Results and Analysis

4.3.1. Discovering More Key Markers Through Already Labeled Ones

The key markers mined by the automatic key marker mining algorithm in this study have greatly improved in variety and quantity. It has reduced the workload and difficulty of manual labeling, laying a foundation for subsequent experiments to identify the six elements of the Toulmin model through key markers. Here, “new markers” refer to lexico-syntactic phrases not present in the original handcrafted list of 58 but automatically discovered by our algorithm from the labeled training corpus based on contextual patterns and attention weights.

Identification of More Varieties of Key Markers

In terms of variety, 43 new key markers were expanded in addition to the original 58, and the increased precision is 71.4%. This precision (71.4%) was computed as the proportion of newly discovered markers that were validated by two human annotators as consistently signaling a specific Toulmin function (e.g., “given that” → Data; “admittedly” → Rebuttal). The detailed experimental results are shown in Table 1.

Identification of a Bigger Quantity of Key Markers

Comparing the number of original sentences containing key markers with the number of sentences expanded by both human identification and the automatic mining key marker algorithm, the experimental results are shown in Figure 6.

The total number of sentences in the training set is 2593, with 838 sentences originally containing key markers. A total of 998 sentences were identified by a human, an increase of 19.1%, and a total of 1030 sentences were identified by a computer, an increase of 22.9%.

The total number of sentences in the test set is 279, with 96 sentences originally containing key markers. A total of 107 sentences were identified by a human, an increase of 11.5%, and a total of 115 sentences were identified by a computer, with increase of 19.8%.

Table 2 presents a comparison of the number of sentences containing key markers in the original, human-identified, and computer-identified datasets. Table 3 and Table 4 present raw outputs from the key marker mining module for reproducibility. The columns id_tag and id_feature are internal system identifiers used for data tracing during development and have no linguistic or analytical meaning. Readers should focus on the sentence excerpts and the extracted marker phrases (e.g., “because”, “however”), which illustrate the types of discourse cues automatically discovered by our algorithm.

The detailed comparative data and experimental results are presented in Table 2, Table 3 and Table 4.

These results demonstrate that the algorithm achieves higher coverage than human annotators in terms of the number of sentences containing key markers, particularly in the test set—suggesting potential for improved recall beyond the training distribution.

The gap between human- and computer-identified sentences (↑ 3.2% in training, ↑ 7.5% in test) arises because our algorithm detects implicit or syntactically varied markers that human annotators may overlook during rapid labeling. Conversely, humans occasionally identify rare or metaphorical cues that fall below the algorithm’s frequency threshold. This complementary behavior highlights the value of human–AI collaboration.

4.3.2. Results of Identification of Argument Elements Based on Key Markers

To ensure a fair and reproducible comparison between human and model processing times, we conducted a controlled annotation study in which three graduate students in linguistics or education—each with prior training in argumentation theory and at least 20 h of practice on a pilot set—independently annotated all six Toulmin elements and their associated key argumentative markers (e.g., “because”, “therefore”) in each essay using a custom web-based tool that automatically logged start and end timestamps. The reported “human time” reflects the full annotation duration per essay, including reading, labeling, and self-review, but excludes initial training time. Only annotations achieving consensus (inter-annotator agreement κ = 0.82) were used for timing analysis. In contrast, the model’s inference time represents the end-to-end runtime—including BERT encoding and BiLSTM prediction—on the same essays using an NVIDIA RTX 3080 Laptop GPU.

First, we conducted a human–computer comparison to evaluate the efficiency gain of automation in educational settings. The “Human” values represent the average time required for one expert annotator to manually label all Toulmin argument elements across the full test set, measured during a pilot study on 50 essays (mean: 6599 s per essay). In contrast, the KFF-Transformer model reduced processing time to just 3.3 s on GPU (18.9 s on CPU), representing a 99.9% reduction in annotation time. Accuracy (Acc) increased by 4.2%, and the F1-score improved by 4.1%. These results (Table 5) demonstrate that the human–computer interactive KFF-Transformer model achieves substantially higher efficiency while delivering more accurate and comprehensive identification of argumentation elements compared to manual processing.

To validate our approach against existing work, we implemented and evaluated the FWFBA model proposed by Yang et al. [26] under identical experimental conditions—including the same dataset, hardware (Intel i7-11800H CPU/NVIDIA RTX 3080 Laptop GPU), and evaluation protocol—with comparative results shown in Table 6. The KFF-Transformer achieved 18.9% faster inference on CPU and 17.5% faster on GPU than FWFBA, enabling near-real-time feedback (<4 s on GPU), which is critical for interactive writing support. It also improved accuracy by 3.7% and F1-score by 2.8%. Empirical user data further confirms these gains: users modified only 11% of KFF-Transformer’s annotations, compared with 23% for FWFBA [26], indicating lower cognitive load [2]. The higher F1-score (66.7%) reflects more consistent element identification, allowing learners to focus on higher-order tasks—such as strengthening claims or refining rebuttals—rather than correcting system errors. Together, these improvements enhance the human–computer interaction experience through faster response times and more reliable argument analysis. Users reported higher satisfaction with the system’s responsiveness and noted that the precise identification of Toulmin elements helped them better understand and refine their argumentative writing.

The summary of experimental results is shown in Table 6, with detailed data provided in Figure 7, Figure 8, Figure 9 and Figure 10. In the four figures, “or” represents the original comparative FWFBA experimental method, while “mo” represents the modified KFF-Transformer experimental method using the new model.

Experimental results show that compared with the method of not mining more key markers and identifying whether the sentence belongs to argument elements immediately after detecting key markers, the KFF-Transformer model proposed in this study has advantages in efficiency and has improved accuracy and F1-score for identifying Toulmin’s six elements.

While large transformers achieve competitive accuracy, they operate as black boxes and lack fine-grained marker guidance. In contrast, our KFF-Transformer provides interpretable, token-level feedback aligned with pedagogical needs—enabling users to understand why an element was identified.

This fully illustrates the importance of mining more key markers and also proves the effectiveness of the model proposed in this study. From an HCI perspective, these improvements translate to a more intuitive user experience, as the identification of key markers enables more precise visual highlighting of argument components within the interface. Users can interact with these identified elements through an interactive system, allowing them to modify the system’s suggestions. The real-time feedback loop between user inputs and system responses facilitates a collaborative human–computer interaction approach to argument analysis, significantly improving both the learning experience for novice writers and the analytical efficiency for expert users.

In summary, the experimental results validate two core hypotheses:

(1): Automated key marker expansion significantly enhances coverage and diversity without sacrificing precision.
(2): The KFF-Transformer model, grounded in these enriched markers, outperforms both manual annotation and prior computational models in speed, accuracy, and user experience.

This synergy between marker-aware modeling and interactive design exemplifies how AI can augment—not replace—human reasoning in smart learning environments.

5. Conclusions

This study presents KFF-Transformer, a human–AI collaborative framework for fine-grained argument analysis that integrates structured argumentation theory with advanced deep learning models and interactive system design. By embedding Toulmin’s model into an intelligent computing system, the proposed framework moves beyond traditional static scoring approaches and demonstrates how real-time, explainable, and collaborative AI systems can be realized in practice.

Experimental results show that KFF-Transformer achieves notable improvements in both efficiency and accuracy, including a 3.7 percentage-point increase in accuracy, a 2.8-point improvement in F1-score, and a processing time reduction of 18.9% on CPU—with near-real-time inference (3.3 s) achieved on GPU platforms. Critically, the human-in-the-loop interaction mechanism not only improves annotation reliability but also empowers users to actively shape system behavior through dynamic corrections, thereby fostering a co-adaptive writing support experience.

From a system perspective, the core contribution of this work lies in the synergistic integration of fine-grained, key marker-driven argument identification, a mixed-initiative interaction paradigm, and an efficient, hardware-aware implementation. Specifically, the framework grounds its predictions in linguistically meaningful key markers to enhance interpretability; it preserves human agency by enabling users to review, correct, and refine automated suggestions in real time; and it is engineered for responsiveness on widely available computing devices—from laptops equipped with consumer-grade GPUs to cloud-based infrastructures—ensuring practical deployability without sacrificing performance. Together, these features position KFF-Transformer as a practical exemplar of trustworthy, human-centered AI in educational technology.

From a computing and artificial intelligence perspective, this work contributes a scalable, interpretable, and responsive argument mining framework that successfully bridges deep learning architectures with interactive design principles. While recent large transformer-based models offer strong performance, they often lack transparency and fine-grained interpretability. In contrast, our KFF-Transformer grounds its predictions in explicit, token-level key markers—providing not only high accuracy but also clear, pedagogically meaningful explanations for each identified argument element. This transparency makes the system particularly suitable for educational settings, where understanding why a claim or rebuttal was detected is as crucial as the detection itself.

Nevertheless, error analysis on misclassified instances reveals several limitations. The model occasionally fails when argument elements lack explicit key markers (e.g., implicit claims like “Uniforms are unfair”), struggles with syntactically nested rebuttals (e.g., “Although X, which challenges Y, Z remains valid”), and exhibits reduced robustness on domain-specific jargon outside the training distribution. These cases highlight the current reliance on marker-driven signals—a limitation partially mitigated in practice by the human-in-the-loop interface, which allows users to correct ambiguous outputs during interactive review.

Future work will focus on extending the framework to multilingual and cross-domain argumentative texts, integrating large language models (LLMs) for hybrid symbolic-neural reasoning, and optimizing the pipeline for large-scale deployment in cloud-native intelligent tutoring systems. Additionally, we plan to investigate long-term user engagement and learning gains through classroom-based longitudinal studies.

Author Contributions

All authors contributed to the conception and design of the study. X.C. performed data curation, software development, formal analysis, validation, investigation, methodology, visualization, original draft writing, and manuscript review. J.Y. provided supervision and contributed to methodology, conceptualization, formal analysis, and manuscript review. M.Z. contributed to software development and visualization. J.Z. contributed to methodology and conceptualization and assisted in writing the original draft. All authors participated in the analysis and interpretation of data, critically revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed during this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank Yingliang Liu for contributing to data curation and participating in manuscript review.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hall, E.; Seyam, M.; Dunlap, D. Exploring explainability and transparency in automated essay scoring systems: A user-centered evaluation. In International Conference on Human-Computer Interaction; Springer Nature: Cham, Switzerland, 2024; pp. 266–282. [Google Scholar]
Nie, J.; Yuan, Y.; Chao, X.; Li, Y.; Lv, L. In smart classroom: Investigating the relationship between human-computer interaction, cognitive load and academic emotion. Int. J. Hum.-Comput. Interact. 2024, 40, 3528–3538. [Google Scholar] [CrossRef]
Zhang, A. A human-computer interaction system for teacher-student interaction modeling using machine learning. Int. J. Hum.-Comput. Interact. 2025, 41, 1817–1828. [Google Scholar] [CrossRef]
Toala, R.; Durães, D.; Novais, P. Human-computer interaction in intelligent tutoring systems. In Proceedings of the International Symposium on Distributed Computing and Artificial Intelligence, Beijing, China, 13–15 October 2019; Springer International Publishing: Cham, Switzerland, 2019; pp. 52–59. [Google Scholar]
Toulmin, S.E. The Uses of Argument; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Sweller, J. Cognitive load theory and educational technology. Educ. Technol. Res. Dev. 2020, 68, 1–16. [Google Scholar] [CrossRef]
Anderson, J.R. The Adaptive Character of Thought; Psychology Press: East Sussex, UK, 2013. [Google Scholar]
Shoeibi, N. Evaluating the effectiveness of human-centered AI systems in education. Int. J. Educ. Technol. Hig. Educ. 2023, 20, 1–20. [Google Scholar]
Bannò, S.; Matassoni, M. Cross-corpora experiments of automatic proficiency assessment and error detection for spoken English. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educ. Applications (BEA 2022), Online, 15 July 2022; pp. 82–91. [Google Scholar]
Bai, X.; Nordin, N.R.M. Human-AI collaborative feedback for improving EFL writing performance: An analysis based on natural language processing. Eurasian J. Appl. Linguist. 2025, 11, 1–19. [Google Scholar]
Grudin, J. Computer-supported cooperative work: History and focus. Computer 1994, 27, 33–40. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding; North American Chapter of the Association for Computational Linguistics: Minneapolis, MN, USA, 2019. [Google Scholar]
Peldszus, A.; Stede, M. From argument diagrams to argumentation mining in texts: A survey. Int. J. Cogn. Inform. Nat. Intell. 2013, 7, 1–31. [Google Scholar] [CrossRef]
Nussbaum, E.M. Argumentation, Dialogue Theory, and Probability Modeling: Alternative Frameworks for Argumentation Research in Education. Educ. Psychol. 2011, 46, 84–106. [Google Scholar] [CrossRef]
Schwendimann, B.A.; Rodriguez-Triana, M.J.; Vozniuk, A.; Prieto, L.P.; Boroujeni, M.S.; Holzer, A.; Gillet, D.; Dillenbourg, P. Perceiving learning at a glance: A systematic literature review of learning dashboard research. IEEE Trans. Learn. Technol. 2017, 10, 30–41. [Google Scholar] [CrossRef]
Liu, D. A new modification to the Toulmin model as an analytical framework for argumentative essays. In Proceedings of the International Conference on AI Logic and Applications, Changchun, China, 26–28 August 2022; Springer Nature: Singapore, 2022; pp. 211–224. [Google Scholar]
Lawrence, J.; Reed, C. Argument mining: A survey Computational linguistics. Comput. Linguist. 2020, 45, 765–818. [Google Scholar] [CrossRef]
Laha, A.; Raykar, V. An empirical evaluation of various deep learning architectures for bi-sequence classification tasks. In Proceedings of the International Conference on Computational Linguistics, Osaka, Japan, 11–16 December 2016; Association for Computational Linguistics: Straussburg, PA, USA, 2016; pp. 123–134. [Google Scholar]
Kusmantini, H.A.; Asror, I.; Bijaksana, M.A. Argumentation mining: Classifying argumentation components with partial tree kernel and support vector machine for constituent trees on imbalanced persuasive essays. J. Phys. Conf. Ser. 2019, 1192, 012009. [Google Scholar] [CrossRef]
Li, M.; Gao, Y.; Wen, H.; Du, Y.; Liu, H.; Wang, H. Joint RNN model for argument component boundary detection. In Proceedings of the 2017 IEEE International Conference on Systems Man and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 57–62. [Google Scholar]
Bahdanau, D.; Cho, K.H.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Stab, C.; Gurevych, I. Identifying argumentative discourse structures in persuasive essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 46–56. [Google Scholar]
Toledo-Ronen, O.; Orbach, M.; Bilu, Y.; Spector, A.; Slonim, N. Multilingual argument mining: Datasets and analysis. arXiv 2020, arXiv:2010.06432. [Google Scholar] [CrossRef]
Alkhawaldeh, F.; Yuan, T.; Kazakov, D.L. Warrant generation through deep learning. In Proceedings of the Seventh International Conference on Natural Language Computing (NATL 2021), London, UK, 27–28 November 2021; AIRCC Publishing Corporation: Chennai, India, 2021; pp. 53–75. [Google Scholar]
Mirzababaei, B.; Pammer-Schindler, V. Developing a conversational agent’s capability to identify structural wrongness in arguments based on Toulmin’s model of arguments. Front. Artif. Intell. 2021, 4, 645516. [Google Scholar] [CrossRef]
Yang, J.; Zheng, M.; Liu, Y. Fusion weighted features and BiLSTM-attention model for argument mining of EFL writing. Front. Psych. 2023, 14, 1049266. [Google Scholar] [CrossRef]
Fromm, M.; Faerman, E.; Berrendorf, M.; Bhargava, S.; Qi, R.; Zhang, Y.; Dennert, L.; Selle, S.; Mao, Y.; Seidl, T. Argument mining-driven analysis of peer reviews. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4758–4766. [Google Scholar] [CrossRef]
Kim, D.; Lee, S.; Park, C. Development of an automated approach to acquiring static street view images using a Vision Transformer for building façade analysis. Sensors 2023, 23, 2841. [Google Scholar]
Wen, C.; Li, Z.; Jin, R. Evaluating cross-building transferability of attention-based automated fault detection and diagnosis for air handling units: Auditorium and hospital case studies. Energy Build. 2023, 285, 112932. [Google Scholar]
Luhn, H.P. A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1957, 1, 309–317. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Sarkar, K.; Nasipuri, M.; Ghose, S. Machine learning-based keyphrase extraction: Comparing decision trees, naïve Bayes, and artificial neural networks. J. Inf. Process. Syst. 2012, 8, 693–712. [Google Scholar] [CrossRef]
Guleria, A.; Sood, R.; Singh, P. Automatic keyphrase extraction using SVM. In Advances in Communication and Computational Technology: Select Proceedings of ICACCT 2019; Springer: Singapore, 2021; pp. 945–956. [Google Scholar]
Li, S.; Jiang, T.; Zhang, Y. A phrase-level attention-enhanced CRF for keyphrase extraction. In Proceedings of the European Conference on Information Retrieval, Glasgow, Scotland, 24–28 March 2024; Springer Nature: Cham, Switzerland, 2024; pp. 455–469. [Google Scholar]
Cheng, B.; Shi, S.; Xiao, S. Keyword extraction for journals based on part-of-speech and BiLSTM-CRF combined model. Data Anal. Knowl. Discov. 2020, 5, 101–108. [Google Scholar]
Sun, S.; Liu, Z.; Xiong, C.; Liu, Z.; Bao, J. Capturing global informativeness in open-domain keyphrase extraction. In Proceedings of the 10th CCF International Conference on Natural Language Processing and Chinese Computing (NLPCC 2021), Qingdao, China, 13–17 October 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 275–287. [Google Scholar]
Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef] [PubMed]
Yan, L.; Zhang, W.; Song, K. Research on traffic sign detection model based on Transformer. J. East China Jiaotong Univ. 2024, 41, 61–69. [Google Scholar]
Horvitz, E. Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Pittsburgh, PA, USA, 15–20 May 1999; pp. 159–166. [Google Scholar]
Stab, C.; Gurevych, I. Parsing argumentation structures in persuasive essays. Comput. Linguist. 2017, 43, 619–659. [Google Scholar] [CrossRef]
Eger, S.; Daxenberger, J.; Gurevych, I. Neural End-to-End Learning for Computational Argumentation Mining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 112–122. [Google Scholar]

Figure 1. The flow chart of argument identification.

Figure 2. Human–computer interaction system.

Figure 3. Human–computer interaction user interface.

Figure 4. The flow chart of key marker mining.

Figure 5. The argumentation element identification model of English argumentative essays based on markers. Output denotes the predicted Toulmin argument element label for each sentence.

Figure 6. Comparison of the number of original and newly expanded key markers.

Figure 7. Comparison of argument element identification time in the CPU version.

Figure 8. Comparison of argument element identification time in the GPU version.

Figure 9. Comparison of accuracy (Acc) for argument element identification.

Figure 10. Comparison of F1-score for argument element identification.

Table 1. Comparison of original and newly expanded key marker varieties.

Toulmin Elements	Original Key Markers	Newly Expanded Key Markers
Claim	‘i think’, ‘i believe’, ‘in my opinion’, ‘as far as i am concerned’, ‘as for me’, ‘we should’	‘i admit’, ‘i find’, ‘think’, ‘we believe’, ‘i firmly’, ‘i have’, ‘all should’, ‘far’, ‘opinion’, ‘believing’, ‘propose’, ‘concerned’, ‘could’, ‘needed’, ‘seize’, ‘individuals’, ‘learnt’
Data	‘percentages’, ‘averages’, ‘ratios’, ‘because’, ‘for that reason’, ‘for one thing’, ‘a men’, ‘for example’, ‘for instance’, ‘for sake of’, ‘besides’, ‘therefore’, ‘thus’, ‘what’s more’, ‘moreover’, ‘further more’, ‘a lady’, ‘as a saying goes’, ‘on the one hand’, ‘on the other hand’, ‘first’, ‘second’, ‘third’, ‘firstly’, ‘secondly’, ‘thirdly’, ‘to begin with’, ‘in addition’, ‘first and foremost’, ‘in the second place’, ‘last but not least’, ‘ultimately’, ‘in the final analysis’, ‘to sum up’, ‘all in all’, ‘to crown it all’	‘all in’, ‘for’, ‘anything’, ‘economic’, ‘pass’, ‘poll’, ‘economic develop’, ‘put’, ‘reaching’, ‘empire’, ‘keepers’, ‘lifespan’, ‘one’, ‘speed’, ‘vehicle’, ‘visit’, ‘vulture’, ‘watched’
Counter claim	‘but’, ‘however’, ‘although’, ‘claimed’, ‘despite all the above’, ‘even though’, ‘in contrast’, ‘undoubtedly’, ‘nevertheless’, ‘admittedly’, ‘argued’, ‘asserted’, ‘despite’, ‘contended’, ‘maintained’, ‘skeptically’	‘but in’, ‘but in my opinion’, ‘though’, ‘even they’, ‘restrict’, ‘even think’, ‘hardly’, ‘painful’

Table 2. The comparison of the number of sentences containing key markers: original, human-identified, and computer-identified. “↑” indicates that the computer-identified coverage is higher than the human-identified coverage.

Dataset	Original	Human-Identified	Human Increment	Computer-Identified	Computer Increment	Human vs. Computer
Training (2593)	838	998	19.1%	1030	22.9%	↑ 3.2%
Test (279)	96	107	11.5%	115	19.8%	↑ 7.5%

Table 3. The expanded sentences containing key markers identified in the training set (part).

id_target	sen	id_sens	id_pos	id_stop	id_tag	id_position	id_feature	key_word
6	[‘therefore’, ‘we’, ‘should’, …]	[3568, 1010, 2057, …]	[21,144, 1010, 10,975, …]	[3568, 1010, 5204, …]	[3568, 2323, 5204, …]	[1.0, 1.0, 1.0, …]	[1, 3, 2, …]	therefore
0	[‘therefore’, ‘we’, ‘should’, …]	[3568, 1010, 2057, …]	[21,144, 1010, 10,975, …]	[3568, 1010, 3579, …]	[3568, 2323, 3579, …]	[0.4, 1.0, 0.5, …]	[1, 3, 2, …]	therefore
0	[‘moreover’, ‘those’, ‘detailed’, …]	[9308, 1010, 2216, …]	[21,144, 1010, 26,718, …]	[9308, 1010, 6851, …]	[9308, 2216, 6851, …]	[0.5, 0.1, 0.75, …]	[1, 3, 3, …]	moreover
1	[‘for’, ‘example’, ‘the’, …]	[2005, 2742, 1010, …]	[1999, 1050, 2078, …]	[2742, 1010, 10,499, …]	[2742, 10,499, 2047, …]	[0.6, 0.4, 0.75, …]	[1, 1, 3, …]	for example
6	[‘therefore’, ‘the’, ‘most’, …]	[3568, 1010, 1996, …]	[21,144, 1010, 26,718, …]	[3568, 1010, 3120, …]	[3568, 2087, 3120, …]	[1.0, 1.0, 1.0, …]	[1, 3, 3, …]	therefore
1	[‘therefore’, ‘as’, ‘life’, …]	[3568, 1010, 2004, …]	[21,144, 1010, 1999, …]	[3568, 1010, 2166, …]	[9502, 3111, 3568, …]	[0.35, 0.25, 0.5, …]	[1, 3, 3, …]	therefore
1	[‘because’, ‘i’, ‘know’, …]	[2138, 1045, 2113, …]	[1999, 1050, 2078, …]	[2113, 2130, 2896, …]	[2145, 2138, 2113, …]	[0.4, 0.5, 0.5, …]	[1, 3, 3, …]	because
0	[‘however’, ‘in’, ‘my’, …]	[2174, 1010, 1999, …]	[21,144, 1010, 1999, …]	[2174, 1010, 5448, …]	[2174, 2026, 5448, …]	[0.3, 0.28, 0.66, …]	[2, 3, 0, …]	however
0	[‘however’, ‘i’, ‘quite’, …]	[2174, 1010, 1045, …]	[21,144, 1010, 29,017, …]	[2174, 1010, 3243, …]	[2174, 3243, 21,090, …]	[0.4, 1.0, 0.3, …]	[2, 3, 3, …]	however
1	[‘for’, ‘instance’, ‘an’, …]	[2005, 6013, 1010, …]	[1999, 1050, 2078, …]	[6013, 1010, 8216, …]	[6013, 8216, 3815, …]	[0.9, 1.0, 0.6, …]	[1, 1, 3, …]	for instance
1	[‘however’, ‘it’, ‘causes’, …]	[2174, 1010, 2009, …]	[21,144, 1010, 10,975, …]	[2174, 1010, 5320, …]	[2174, 5320, 2200, …]	[0.1, 0.3, 0.25, …]	[2, 3, 3, …]	however
0	[‘however’, ‘i’, ‘quite’, …]	[2174, 1010, 1045, …]	[21,144, 1010, 29,017, …]	[2174, 1010, 3243, …]	[2174, 3243, 21,090, …]	[0.3, 1.0, 0.25, …]	[2, 3, 3, …]	however
0	[‘moreover’, ‘i’, ‘think’, …]	[9308, 1010, 1045, …]	[21,144, 1010, 29,017, …]	[9308, 1010, 2228, …]	[9308, 2228, 2190, …]	[0.68, 0.3, 0.75, …]	[1, 3, 0, …]	moreover

Table 4. The expanded sentences containing key markers identified in the test set (part).

id_target	sen	id_sens	id_pos	id_stop	id_tag	id_position	id_feature	key_word
3	[‘besides’, ‘because’, ‘of’, …]	[4661, 1010, 2138, …]	[1999, 1010, 1999, …]	[4661, 1010, 4094, …]	[4661, 2138, 4094, …]	[0.75, 0.6, 0.75, …]	[1, 3, 1, …]	besides
6	[‘because’, ‘everyone’, ‘has’, …]	[2138, 3071, 2038, …]	[1999, 1050, 2078, …]	[3071, 2157, 4965, …]	[2138, 3071, 2157, …]	[0.36, 0.6, 0.3, …]	[1, 3, 3, …]	because
1	[‘cars’, ‘could’, ‘not’, …]	[3765, 2071, 2025, …]	[1050, 3619, 9108, …]	[3765, 2071, 2693, …]	[3765, 4002, 2071, …]	[0.5, 0.3, 0.6, …]	[3, 3, 3, …]	because
6	[‘in’, ‘my’, ‘opinion’, …]	[1999, 2026, 5448, …]	[1999, 10,975, 2361, …]	[5448, 1010, 2228, …]	[2026, 5448, 2228, …]	[0.17, 0.75, 0.3, …]	[0, 0, 2, …]	in my opinion
1	[‘because’, ‘some’, ‘people’, …]	[2138, 2070, 2111, …]	[1999, 26,718, 1050, …]	[2111, 2089, 6100, …]	[2138, 2070, 2111, …]	[0.35, 0.66, 0.6, …]	[1, 3, 3, …]	because
0	[‘in’, ‘my’, ‘opinion’, …]	[1999, 2026, 5448, …]	[1999, 10,975, 2361, …]	[5448, 1010, 6851, …]	[2026, 5448, 6851, …]	[0.08, 1.0, 0.25, …]	[0, 0, 2, …]	in my opinion
1	[‘for’, ‘example’, ‘the’, …]	[2005, 2742, 1010, …]	[1999, 1050, 2078, …]	[2742, 1010, 8735, …]	[8814, 4819, 2742, …]	[0.8, 0.6, 0.8, …]	[1, 1, 3, …]	for example
6	[‘in’, ‘conclusion’, ‘it’, …]	[1999, 7091, 1010, …]	[1999, 1050, 2078, …]	[7091, 1010, 23,934, …]	[7091, 23,934, 9611, …]	[0.95, 0.5, 1.0, …]	[3, 3, 3, …]	because
1	[‘for’, ‘example’, ‘encouraging’, …]	[2005, 2742, 1010, …]	[1999, 1050, 2078, …]	[2742, 1010, 11,434, …]	[3902, 2742, 11,434, …]	[0.9, 0.8, 1.0, …]	[1, 1, 3, …]	for example
0	[‘in’, ‘my’, ‘opinion’, …]	[1999, 2026, 5448, …]	[1999, 10,975, 2361, …]	[5448, 1010, 2228, …]	[2026, 5448, 2228, …]	[0.3, 1.0, 0.25, …]	[0, 0, 2, …]	in my opinion
1	[‘besides’, ‘for’, ‘consumers’, …]	[4661, 1010, 2005, …]	[1050, 3619, 1010, …]	[4661, 1010, 10,390, …]	[4661, 10,390, 2027, …]	[0.6, 0.6, 0.75, …]	[1, 3, 3, …]	besides
6	[‘take’, ‘the’, ‘white’, …]	[2202, 1996, 2317, …]	[1058, 2497, 26,718, …]	[2202, 2317, 1011, …]	[2202, 2317, 9127, …]	[0.18, 0.5, 0.33, …]	[3, 3, 3, …]	for example
0	[‘in’, ‘my’, ‘opinion’, …]	[1999, 2026, 5448, …]	[1999, 10,975, 2361, …]	[5448, 1010, 2893, …]	[2026, 5448, 2893, …]	[0.3, 0.3, 0.5, …]	[0, 0, 2, …]	in my opinion

Table 5. The human–computer comparison of time cost, Acc, and F1-score for argumentation element identification. “↑” indicates an improvement compared with the human baseline. “↓” indicates a reduction compared with the human baseline.

Identification	Human	Computer		Human vs. Computer
Identification	Human	KFF-Transformer (CPU)	KFF-Transformer (GPU)	Human vs. Computer
Time Cost	6599.0 s	18.9 s	3.3 s	↓ 99.7% (CPU) ↓ 99.9% (GPU)
Acc	69.3%	72.2%	72.2%	↑ 4.2%
F1	64.1%	66.7%	66.7%	↑ 4.1%

Table 6. Comparison of time cost, Acc, and F1-score for argumentation element identification between the FWFBA and KFF-Transformer models. “↑” indicates that KFF-Transformer outperforms FWFBA. “↓” indicates a reduction in time cost compared with FWFBA.

Identification	FWFBA (CPU)	FWFBA (GPU)	KFF-Transformer (CPU)	KFF-Transformer (GPU)	FWFBA vs. KFF-Transformer
Time cost	23.3 s	4.0 s	18.9 s	3.3 s	↓ 18.9% (CPU) ↓ 17.5% (GPU)
Acc	69.6%	69.6%	72.2%	72.2%	↑ 3.7%
F1	64.9%	64.9%	66.7%	66.7%	↑ 2.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cai, X.; Yang, J.; Zheng, M.; Zhu, J. KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification. Appl. Sci. 2026, 16, 1451. https://doi.org/10.3390/app16031451

AMA Style

Cai X, Yang J, Zheng M, Zhu J. KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification. Applied Sciences. 2026; 16(3):1451. https://doi.org/10.3390/app16031451

Chicago/Turabian Style

Cai, Xuxun, Jincai Yang, Meng Zheng, and Jianping Zhu. 2026. "KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification" Applied Sciences 16, no. 3: 1451. https://doi.org/10.3390/app16031451

APA Style

Cai, X., Yang, J., Zheng, M., & Zhu, J. (2026). KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification. Applied Sciences, 16(3), 1451. https://doi.org/10.3390/app16031451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

KFF-Transformer: A Human–AI Collaborative Framework for Fine-Grained Argument Element Identification

Abstract

1. Introduction

2. Related Work

2.1. Toulmin Model

2.2. Framework of Argument Identification Technology

2.3. Deep Learning Approaches for Sequence Labeling

2.4. Identification of Key Markers

3. Identification of Argument Elements Based on Key Markers

3.1. HCI System Design for Collaborative Argument Analysis

3.2. Automatic Key Marker Mining Algorithm

3.3. Key Marker-Based Argument Element Identification Model

4. Experiment

4.1. Experimental Settings and Parameters

4.2. Evaluation Indicators

4.3. Experimental Results and Analysis

4.3.1. Discovering More Key Markers Through Already Labeled Ones

Identification of More Varieties of Key Markers

Identification of a Bigger Quantity of Key Markers

4.3.2. Results of Identification of Argument Elements Based on Key Markers

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI