Existing methods for measuring deliberative quality, particularly the DQI and computational argumentation approaches, are reviewed. This section highlights the key limitations of the DQI and introduces the Abstract Argumentation Framework (AAF) as a complementary method to enhance discourse analysis.
2.1. Deliberation and Deliberative Quality
Deliberation refers to carefully considering and discussing different perspectives and ideas before deciding on reaching a conclusion; it centers around individuals engaging in well-informed discussions to reach rational conclusions [
9]. It involves thoughtful analysis, active listening, and open-mindedness [
10,
11]. It can be applied to various contexts, from local discussions to international politics. Critics have raised concerns about social inequalities and the potential for certain groups to dominate deliberative settings. Other factors such as the domination, issue polarization, and agenda settings can affect the quality of deliberation [
12]. It is thus essential to measure the quality of these processes. Deliberative discussions involves reaching a consensus on a course of action for an underlying topic. Various mechanisms within social sciences have been invented and applied to assess the quality of such deliberations [
1,
2,
4,
6,
13]. Analyzing and assessing the quality of a deliberation process is crucial because it allows an assessment of whether decisions are well-informed, fair, and inclusive. Any deliberation studies have been evaluated by the empirical method DQI [
1,
14], for example to conduct analysis on parliamentary debates [
10]. Here, the emphasis is on rational, inclusive discourse where participants justify claims through validity-driven argumentation that is free from coercion. The primary focus of the DQI is on equal participation, justification, the content of justification, counterarguments, respect, and constructive politics. Still, it does not tap into the inclusion and equality of participants and satire in deliberation. For example, one speaker may have more speaking time than others in deliberation [
15] or say something humorous which may not meet the deliberative criteria.
Recent critiques highlight the DQI’s inability to capture structural inequalities in deliberation. Mockler [
16] critiques the DQI by introducing the concept of deliberative uptake, which assesses whether a participant’s ideas are acknowledged and engaged within deliberation. Using Ireland’s Convention on the Constitution, a forum where citizens and politicians are included as members, the research reveals how structural inequalities influence deliberation dynamics, with politicians receiving greater uptake than citizens. This power asymmetry underscores the limitations of traditional metrics like the DQI in capturing inclusion and exposing DQI’s blindness to inclusion and systemic biases. Bächtiger [
17] notes that the DQI’s static aggregation fails to account for dynamic argumentation processes, such as retractions or evolving consensus, limiting its relevance in asynchronous or large-scale deliberations. Further critiques focus on scalability and automation challenges. Fournier-Tombs and MacKenzie demonstrated that machine learning adaptations such as DQI 2.0 remain constrained by dependency on labeled datasets and human expertise, struggling with contextual nuances, such as with the identification of ‘constructive politics’ in social media debates [
3]. Despite leveraging LLMs, AQuA inherits the DQI’s conceptual blind spots, such as neglecting conflict resolution [
18]. In 2014, Deliberative Transformative Moments (DTMs) were introduced as the DTM framework [
4], an extension of the DQI designed for small-group deliberations, and it revealed that personal narratives such as those shared by Colombian ex-combatants can elevate or diminish deliberation quality. By focusing on predefined indicators like justification and respect, the DQI neglects the emotional and narrative dimensions of discourse, which are particularly salient in post-conflict settings [
4].
The rise of online deliberation platforms has further strained existing metrics. A 2015 systematic review of digital deliberation [
19] found that most studies prioritized platform design over process analysis, leaving a vacuum in tools to assess asynchronous, large-scale discussions. For example, the Index of the Quality of Understanding (IQU) [
20], developed to analyze local political debates in Zurich, quantified participation rates but struggled to differentiate between substantive contributions and superficial engagement. Klinger and Russmann noted that while the IQU captured the quantity of deliberation, for example, using word counts or topic frequency, it inadequately measured the quality of reasoning or inclusivity [
20]. These limitations underscore a broader challenge: though rigorous in controlled settings, existing indices lack the flexibility to adapt to the heterogeneous nature of real-world deliberations. The Deliberative Reason Index (DRI) [
21] is designed to evaluate the quality of deliberation at the group reasoning level instead of concentrating on individual inputs. This method uses surveys before and after discussions to assess participants’ views and preferences with respect to the debated topics. It computes agreement scores combined to produce an overall score for the group. Cognitive complexity (CC) [
13] is another metric for assessing deliberation quality. Although it may not be a flawless indicator of deliberative quality, CC is valuable in exploring how an argument is constructed and what cognitive processes are used when spoken or written. They provide an equation to measure the CC for a single speech act based on nine indicators [
13], which were computed using Linguistic Inquiry and Word Count (LIWC) a computerized text analysis method [
22]. Other studies also use this measure on Twitter, Reddit, and citizen assembly deliberative datasets [
23,
24,
25].
Efforts to automate deliberative analysis have partially addressed these issues. The DelibAnalysis framework [
2], which applies machine learning to classify speech acts based on DQI criteria, demonstrated that computational tools could reduce reliance on manual annotation. By training models on UK parliamentary debates, researchers achieved moderate accuracy in detecting indicators such as ‘respect’ and ‘content of justification’ [
3]. However, DelibAnalysis and its successor, DQI 2.0 [
3], remain constrained by their dependency on high-quality labeled datasets and political analysts’ expertise. For instance, labeling ‘constructive politics’ in Twitter debates requires contextual knowledge of platform-specific norms, which automated systems struggle to generalize [
25]. Recent advancements like AQuA [
18], which integrates Large Language Models (LLMs) with deliberative indices, show promise in scaling analysis but are complex due to subjectivity and the limitations of relying solely on expert annotations (often scarce) or non-expert inputs. Bächtiger et al. (2022) analyzed decades of deliberative research, emphasizing the need to reframe deliberation as a performative, distributed process [
17]. Their critique highlights flaws in the original DQI, such as overlooking participation equality and real-time interactivity. They critique its rigid aggregation methods and systematic biases, arguing against applying uniform DQI metrics across contexts such as parliamentary versus online debates and advocating interdisciplinary approaches to enhance measurement.
2.2. Computational and Abstract Argumentation
Alongside advances in deliberative theory, the Abstract Argumentation Framework [
8] has emerged as a strong paradigm for modeling defeasible reasoning [
26,
27,
28,
29]. This reasoning is non-monotonic, following a type of logic where, in a nutshell, conclusions can be revised based on new information [
30,
31,
32,
33]. In other words, defeasible reasoning is a form of non-monotonic reasoning with the ability to conclude in situations where only partial information is available, and humans achieve this ability by using default knowledge [
29]. However, if new data/evidence become available that contradict the preconditions on which the default knowledge is based, any conclusions drawn using that knowledge can be retracted [
26]. There has been a rapid increase in research on computational argumentation and its systems [
34,
35,
36], which help compute the dialectical status of arguments in a dialogical structure with conflicts and implement non-monotonic reasoning in practice. On the one hand, monotonic reasoning includes arguments whose conclusion can not be retracted in light of other arguments with contradicting findings. The term ‘monotonic’ refers to increased conclusions among the arguments considered. On the other hand, in non-monotonic reasoning, conclusions brought forward by specific arguments can be rejected in the light of further information supported by different arguments; thus, their cardinality can be reduced [
37,
38].
In practice, the reasoning followed in a deliberative process is defeasible because it recognizes that human deliberation is inherently provisional. In other words, arguments are advanced tentatively and subject to revision as new information emerges. For example, a policymaker might initially argue that ‘year-round schooling improves educational outcomes’ (Argument A) but retract this position if confronted with evidence that ‘summer programs mitigate learning loss’ (Argument B). This iterative process mirrors the dialectical nature of deliberative democracy, where consensus is forged through the clash and synthesis of opposing viewpoints. In the case of dialogues for deliberation, human reasoners use default knowledge to create their arguments. Since these reasoners interact, the conclusions supported by each argument can contradict the ones supported by other arguments; thus, retractions are possible. Formally, ‘An argument is a tentative inference that links one or more premises to a conclusion’ [
39]. Various types of arguments can be formalized, including forecast and mitigating arguments [
40]. ‘Forecast arguments are tentative defeasible inferences: they can be seen as justified claims and definitions similar to the argument’. ‘Mitigating arguments question the validity of a forecast argument, whether that argument’s premises or the conclusion are valid, and whether any uncertainties appear’. They help to identify conflicts between arguments. In 1995, Dung proposed the definition of the Abstract Argumentation Framework and how it can help in the acceptability of arguments and resolve conflicts independent of the context of application and their internal structure [
8]. Formally, we have the following: An argumentation framework is a pair
AF = (AR, attacks), where
is a set of arguments, and
is a binary relation on
, which means attacks ⊆
. An example of a framework is
(argumentation graph of
Figure 1).
A set of arguments
S is admissible if and only if it is conflict-free and can defend itself. Let (Ar, attacks) be an argumentation framework, and let
. A set of
S arguments is said to be
conflict-free if there are no arguments
in
S such that A
B. In the literature of argumentation theory, three types of conflict are usually formalized: rebutting, undermining, and undercutting [
41,
42]. A rebutting attack is when an argument negates the conclusion of other arguments [
26,
43]. In an undermining attack, an argument can attack one of its premises by another whose conclusion refuses those premises [
26,
43]. An undercutting attack occurs when an argument uses a defeasible inference rule and is attacked by arguing that a particular case does not allow for the application of that inference [
44]. When an argumentation framework is shaped, its topological structure is often examined to extract conflictual information and create conflict-free argument partitions of acceptable arguments [
45,
46]. Acceptability semantics can be defined in terms of extensions [
8] following different principles:
Various acceptability and extension-based semantics have been proposed for abstract argumentation [
48,
49,
50,
51]. Given
and
, we have the following:
S is conflict-free if (a, b) ∉ R for all a, b ∈ S.
a ∈ A is acceptable w.r.t. S (or equivalently S defends a) if for each b ∈ A with (b, a) ∈ R, there is c ∈ S with (c, b) ∈ R.
The characteristic function of (A, R) is defined by F(S) = a ∈ A such that a is acceptable w.r.t. S.
S is admissible if S is conflict-free and S ⊆ F(S).
S is a complete extension of (A, R) if it is conflict-free and a fixed point of F.
S is the grounded extension of (A, R) if it is the minimal (w.r.t. ⊆) fixed point of F.
S is a preferred extension of (A, R) if it is a maximal (w.r.t. ⊆) complete extension.
S is a stable extension of (A, R) if it is conflict-free, and for each a ∉ S, there is b ∈ S with (b, a) ∈ R.
Below is an example of a discussion on whether kids should attend school during the summer to clarify the formal concepts presented so far. Consider the argumentation framework
and
(
Figure 1):
Argument A: Kids should be in school year-round, even in the summer.
Argument B: Kids should not have a year-round school and should spend summers off.
Argument C: Kids should not spend summer off. They should participate in educational activities like summer camp.
Argument D: Kids should spend part of the summer off and part of the summer in academic programs.
Argument E: Kids should have summers entirely off, but schools should provide voluntary summer programs for those who need or want them.
In
Figure 1, Argument A has a counterargument presented by Argument B, which makes Argument A unacceptable. However, Argument B is attacked by Arguments C and D, Argument E attacks Argument C, and Argument E has no other arguments against it. Therefore, Arguments A, D, and E should be accepted. However, if we accept C, B would be rejected and no longer be a reason against A. Hence, we should accept Argument A as well. This example emphasizes the concept of reinstatement in abstract argumentation [
52] and anticipates the issues that exist behind it [
53]. These issues lead to the design of other semantics not based on the concept of extension [
54,
55]. Many types of argumentation semantics exist. One is ranking-based semantics, which rank each argument based on the notion of attacks [
56]. Ranking-based semantics evaluate arguments based on their strength and acceptability [
54]. In 2001, Bernard and Hunter proposed a categorizer function that gives value to each argument and provides value to its attackers [
57]. Let
be an AF. The categorizer function
is defined as follows:
The categorizer semantics associate to any
a ranking
on
A such that
if
. This approach assigns strength values to arguments and evaluates arguments on a graded scale. It provides a more detailed assessment and opens new possibilities for understanding and evaluating arguments, offering a more comprehensive view than traditional extension-based semantics. The more attacks on arguments, the lower the ranking; the fewer attacks, the higher the ranking [
54].
2.3. Gaps in the Literature
The DQI relies heavily on labor-intensive manual annotation, requiring expert coders to label speech acts for indicators such as justification or respect [
1]. While semi-automated frameworks like DQI 2.0 [
3] and AQuA [
18] leverage machine learning, they inherit the DQI’s dependency on high-quality labeled datasets and political analysts’ contextual expertise. The DQI limits scalability, particularly in large-scale or asynchronous deliberations such as social media, where manual coding becomes impractical. Such constraints highlight the urgent need for computational methods to reduce human bias and automate the structural analysis of argumentation. Traditional DQI metrics adopt static aggregation, averaging scores across speech acts without capturing evolving argument dynamics [
17]. This overlooks non-monotonic reasoning processes, such as retractions of the argument, which are central to real-world deliberation. For instance, the DQI fails to model how new evidence might negate prior claims, which is a gap underscored by critiques of its rigidity in asynchronous debates [
16]. Computational argumentation frameworks, for example, Dung’s AAF [
8], address this by dynamically resolving conflicts through modeling defeasible reasoning, enabling updates to argument acceptability. Such frameworks are often modular, using abstract or structured arguments and defining a notion of attacks to model their interactions to form a knowledge base [
58]. Such a knowledge base can be elicited, and conflictuality and inconsistencies can be resolved, producing a dialectical status for each argument through defeasible principles that mirror provisional human reasoning. Eventually, this might be accrued if a unique inference must be produced.
The DQI’s focus on individual speech acts obscures systemic power imbalances in deliberative settings. As demonstrated in Ireland’s Constitutional Convention, politicians’ arguments received a disproportionate uptake compared to citizens [
16], yet DQI metrics lack the tools to quantify such disparities. Similarly, indices like the DTM [
4] reveal that the DQI neglects emotional or narrative dimensions that are critical in post-conflict contexts. The AAF’s topic-agnostic conflict detection and resolution offers a formal mechanism to evaluate argument interactions independently of content, mitigating biases tied to participant status or cultural context. The DQI’s reliance on predefined indicators, such as ‘common good’ justifications, assumes universal applicability, yet its Eurocentric design struggles in culturally heterogeneous deliberations [
17]. For example, ‘respect’ in online debates may align with platform-specific norms invisible to manual coders [
25]. The AAF bypasses this by abstracting arguments into nodes and attacks, decoupling evaluation from semantic content. This formal approach enables cross-context analysis while preserving Habermasian ideals of rationality [
10]. Existing deliberative metrics operate in disciplinary silos: the DQI prioritizes procedural norms from political theory, while computational tools like AQuA focus on technical scalability [
18]. Few studies bridge these paradigms, leaving a vacuum in frameworks that balance qualitative depth with computational rigor. By integrating the AAF’s ranking-based semantics [
54] and extension-based semantics [
8] with the DQI’s established indicators, this research proposes a hybrid model automating conflict resolution while retaining contextual nuance, thus advancing deliberative theory into computationally controllable domains. A critical yet underexplored limitation of the DQI is its inability to provide real-time feedback during live deliberations. Traditional DQI application is retrospective, analyzing discourse post hoc rather than offering actionable insights to participants or moderators during the process [
17]. This gap is particularly acute in digital platforms like Reddit or real-time citizen assemblies, where dynamic adjustments prompting users to justify claims or address counterarguments could enhance deliberative quality iteratively. Recent studies note that static metrics fail to guide in situ improvements, risking stagnation in polarized debates [
25]. Integrating the AAF’s computational semantics, which are particularly suited for defeasible reasoning [
8], could enable updates to argument acceptability as new evidence emerges. For instance, rebuttals or undercutting attacks could dynamically alter argument rankings, which would be visualized as conflict hot spots for moderators. This non-monotonic logic central to defeasible reasoning allows conclusions to be retracted or revised during deliberation and could be used in real-world dialectics [
26].