Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization

Lee, Dawon; Jung, Hyuckchul; Choi, Yong Suk

doi:10.3390/app151910506

Open AccessArticle

Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization

by

Dawon Lee

¹

,

Hyuckchul Jung

² and

Yong Suk Choi

^3,*

¹

Department of Intelligence and Convergence, Hanyang University, Seoul 04763, Republic of Korea

²

Meta, Ads Privacy ML Infra, New York, NY 10003, USA

³

Department of Computer Science, Hanyang University, Seoul 04763, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10506; https://doi.org/10.3390/app151910506

Submission received: 9 September 2025 / Revised: 22 September 2025 / Accepted: 25 September 2025 / Published: 28 September 2025

Download

Browse Figure

Versions Notes

Abstract

Recent studies on detecting hallucinations in summaries follow a method of decomposing summaries into atomic content units (ACUs) and then determining whether each unit logically matches the document text based on natural language inference. However, this fails to consider discourse link relations such as temporal order, causality, and purpose, leading to the inability to detect conflicts in semantic connections between individual summary ACUs, even when the conflicts are present in the document. To overcome this limitation, this study proposes a method of extracting Discourse Link-Aware Content Unit (DL-ACU) by converting the summary into an Abstract Meaning Representation (AMR) graph and structuring the discourse link relations between ACUs. Additionally, to align summary ACUs with corresponding document information in a fine-grained manner, we propose a Selective Document-Atomic Content Unit (SD-ACU). For each summary ACU, the SD-ACU retrieves only the most relevant document sentences and then decomposes them into document ACUs. Applying the DL-ACU module to existing hallucination detection systems such as FIZZ and FENICE reduces the error rate of discourse link errors on FRANK. When both modules are combined, the system improves balanced accuracy and ROC-AUC across major benchmarks. This suggests the proposed method effectively captures discourse link errors while enabling ACU-to-ACU alignment.

Keywords:

hallucination detection; atomic content unit; AMR graph

1. Introduction

Recently, in NLI-based hallucination detection in summarization systems, summaries are divided into Atomic Content Units (ACUs) [1,2,3,4] and then each fact is verified for consistency with the original document. However, such systems do not consider discourse link information [5] such as temporal order, causality, and purpose between summary ACUs, resulting in discourse link errors (LinkE). In this study, we define the concept of a discourse link not only as relations between sentences but also as a more comprehensive concept that includes semantic connections between ACUs. We observe that even though each ACU is individually in an entailment relationship with sentences in the document, false positive predictions occur when the temporal order, causality, or purpose between them does not align with the document. For example, consider a document stating that “Pfizer announced the results of its phase 3 trial on Monday…The FDA granted emergency use authorization three days later”. A summary that incorrectly claims “The FDA granted emergency use authorization before Pfizer announced its trial results” would still appear consistent when its individual ACUs (“The FDA granted emergency use authorization” and “Pfizer announced its trial results”) are evaluated separately. However, our Discourse Link-Aware Content Unit, which preserves the temporal relation (“The FDA granted authorization before Pfizer announced its results”), correctly identifies the error. This error does not stem from a limitation of the NLI model. Rather, it arises because the input ACUs do not capture discourse link information. Explicitly modeling these links between ACUs is essential for reliable factuality, not only verifying isolated units. Prior work underscores this importance. FRANK [5] formalizes discourse link error and quantifies its prevalence, while FactFT [6] and AMRFact [7] inject temporal and causal perturbations to train stronger evaluators.

In hallucination detection for summarization, a common approach is to decompose the document and the summary and then assess whether document chunks match summary chunks [2,3,4,8,9]. Concretely, consistency is assessed by pairing a summary unit with a candidate document unit and feeding the pair to an NLI model to obtain an entailment decision. However, most prior work extracts ACUs only from the summary while keeping the document at the sentence level. Applying ACU extraction to only one side induces a granularity mismatch that manifests in two cases. First, when the summary is decomposed into ACUs but the document remains at the sentence level, a single document sentence may contain multiple events or facts, making it difficult to find a unit that exactly corresponds to an atomic statement in the summary. Second, an alternative line of research [10] has focused on decomposing the premise, but without a corresponding decomposition of the hypothesis, which also leads to a granularity mismatch. Therefore, a truly precise evaluation requires decomposing both the summary and the document to an atomic level to enable ACU-to-ACU alignment; while ideal, decomposing the entire document to the same extent as the summary would dramatically increase the computational cost of ACU generation models [11,12] and the total number of NLI comparisons required.

To address these two challenges, we propose two complementary modules. An overview of the two pipelines is shown in Figure 1. First, we introduce a new unit called the Discourse Link-Aware Content Unit (DL-ACU), which captures the discourse link relation between ACUs. This unit is constructed by converting summary sentences into AMR graphs [13] and extracting subgraphs centered on discourse link relation markers such as time, cause, and purpose. These subgraphs contain discourse link information between different ACUs, which is then converted to text and filtered using NLI-based filtering to select only high-quality DL-ACUs. DL-ACU can judge the entailment relationship with the document while preserving discourse link information within the summary. Second, we propose the Selective Document-Atomic Content Unit (SD-ACU) to address granularity mismatch and computational cost. For each summary ACU, we retrieve the top-k most relevant document sentences using an entailment score and decompose only those sentences into document-side ACUs. This selective decomposition filters out extraneous information (like attributions or modifiers) from the document sentence, enabling a more precise ACU-to-ACU alignment. For instance, a summary ACU like “the rebels attacked civilians” might initially appear to have weak support when compared against a full document sentence cluttered with attribution, such as “Human-rights monitors stated that rebel groups shot at bystanders…”. By decomposing this source sentence, our method allows for a direct comparison with the relevant document-side ACU, “rebel groups shot at bystanders”, which provides a much clearer and more accurate entailment match. This strategy not only improves ACU-to-ACU alignment but also reduces the number of NLI calls by approximately 30% compared to fully decomposing the document in the FENICE baseline, and reduces Language Model (LM) calls for decomposition by 83% on average in the AggreFact-FtSota benchmark, all while improving balanced accuracy over sentence-level premises. As early quantitative evidence, our best variant improves average balanced accuracy on AggreFact-FtSota by 1.39 percentage points over FIZZ and by 0.99 points over FENICE; on SummaC the gains are 1.78 and 1.36 points, respectively. On DiverSumm, ROC-AUC increases by 0.79 and 2.66 points for FIZZ and FENICE, and on FaithBench by 2.07 and 1.95 points.

Consistent with these gains, the error analysis shows that DL-ACU reduces the LinkE rate by 6.81 percentage points for FIZZ and by 11.36 points for FENICE relative to their family baselines.

2. Related Works

2.1. Hallucination Detection in Summarization

Recent Large Language Models (LLMs) demonstrate outstanding performance in summary generation, but they still seriously suffer from hallucination problems, which involve including information that does not exist in the document. This means that the summary sentences are not consistent with the document. Approaches to evaluating the factuality of summaries can be broadly categorized into NLI-based, QA-based, and LLM-based methodologies.

NLI-based methodologies treat the summary as a hypothesis and the document as a premise to evaluate factuality. They generally follow one of two paradigms. The first, and more common, paradigm involves decomposing the summary and document and using a pre-trained NLI model to verify the resulting chunks [2,3,4,8,9]. The second paradigm focuses on synthesizing negative (hallucinated) samples to fine-tune a specialized NLI model for the task [6,7,14,15]. Systems following the first paradigm, such as SummaC-ZS [8], AlignScore [9], InFusE [4], FIZZ [2], and FENICE [3], typically follow a three-step procedure, the general flow of which is illustrated on the left side of Figure 1. This process involves (1) decomposing the document and summary into chunks; (2) for each summary chunk, calculating NLI scores against all document chunks and selecting the highest score as the evidence; and (3) merging the scores for each summary chunk using an average (soft aggregation) or minimum (hard aggregation) to compute a final score. For instance, SummaC-ZS, AlignScore, InFusE, and FENICE use soft aggregation, while FIZZ uses hard aggregation, while effective, these methodologies uniformly decompose the document only at the sentence level, creating a granularity mismatch when the summary is broken into finer-grained ACUs. By contrast, the second paradigm, including models like FactCC [15], FactFT [6], and AMRFact [7], synthesizes hallucinated summaries as negative samples to train NLI models. In particular, FactFT and AMRFact curate negatives that instantiate challenging error types like LinkE and CorefE [5]. However, because these approaches do not decompose the inputs, they offer limited fine-grained error analysis and can suffer from input truncation on long-document benchmarks like DiverSumm [4] and FaithBench [16].

Beyond NLI, QA-based methods (e.g., QuestEval [17], QAFactEval [18]) and LLM-based evaluators (e.g., G-Eval [19], ACUEval [20]) have also been explored. Graph-based methods like FactGraph [21] evaluate semantic overlap using AMR graphs, while DAE [22] performs arc level entailment over dependency graphs.

While many methods exist, prior work typically omits discourse link relations and compares fine-grained summaries with sentence-level documents, causing a granularity mismatch. Addressing these two issues can improve evidence selection and reduce computation in NLI-based verification, especially for long documents.

2.2. Atomic Content Unit Decomposition

Atomic Content Unit (ACU) is a verifiable unit of information that cannot be further subdivided, and regarded as the minimal semantic unit for precise factuality assessment of summaries [1,23]. Leveraging ACUs enables localized detection of non-factual spans within a summary; subsequent DAE [22] demonstrated that such decomposition effectively identifies which part of the generation is non-factual.

Early approaches primarily operated at the sentence level. SummaC [8] and AlignScore [9] decompose summaries into sentences. More recent studies pursue finer evaluation by extracting summaries at the ACU level. For instance, InFusE [4], FENICE [3] and FIZZ [2] employ large generative models. T5-base [12], Orca-2 [11] extract ACUs from summaries, following the definition proposed in ROSE [1]. ROSE defines an ACU as the smallest verifiable unit; FENICE describes an ACU as an elementary information unit found in the summary, and FIZZ characterizes an ACU as a short and concise information unit. Fundamentally, the definitions from FIZZ and FENICE align with that of ROSE, and both studies use the ROSE benchmark to evaluate the ACU decomposition capabilities of their respective models, T5-base_FENICE and ORCA-2_FIZZ.

2.3. AMR Graph

Abstract Meaning Representation (AMR) is a predicate-centered semantic graph formalism that was introduced in early work [13]. It abstracts sentence meaning to capture deep semantic structure. AMR graphs are commonly rendered in PENMAN notation [24] and encode relations among predicates, core roles (:ARG0, :ARG1, etc.), and non-core roles (:time, :location, etc.). For instance, “He studied to pass the exam” expresses purpose via :purpose marker, while “Alice gave a present to Bob on Monday” makes action, participants, and time explicit via :ARG0, :ARG1, :ARG2, :time markers. Such representations are advantageous for modeling semantic structures that simple surface cues cannot capture. The performance of sequence-to-graph (STOG) and graph-to-sequence (GTOS) models has improved dramatically with the advent of Transformer-based architectures [12,25], enabling their use in a wide range of downstream AMR applications [26].

Our ACU decomposition follows a recent baseline [27]. Baseline first parses each sentence into an AMR graph using a sequence-to-graph model. From the graph, it selects subgraphs that connect a predicate with its core roles. Each subgraph is turned into an ACU by the graph-to-sequence model.

We extend this pipeline to capture discourse link information. We form subgraphs that make the links explicit, such as temporal order (:time), causality (:cause), and purpose (:purpose). From these, we extract the Discourse Link-Aware Content Unit (DL-ACU). This helps reveal discourse link errors that recent ACU-based verifications [2,3,4] often miss.

3. Methods

As illustrated in Figure 1, the right panel (Steps 1–4) shows the Discourse Link-Aware Content Unit (DL-ACU) module, and the left panel (Steps 5 and 6) shows the Selective Document-ACU (SD-ACU) module. Step 7 performs NLI scoring and aggregation shared by both families. We explicitly reference these numbered steps below to align the text with the figure.

3.1. Discourse Link-Aware Content Unit Decomposition Using AMR Graphs

When a summary S is given, we aim to obtain the set F of ACUs contained in it. In particular, our objective is to account for discourse links that prior ACU decomposition [2,3,4] methods have overlooked. Our method consists of five steps. An overview of this process is visualized on the right side of Figure 1.

3.1.1. Subgraph Extraction Based on Discourse Link

As shown in Step 1 of Figure 1, we first build an AMR graph for the entire summary using a sequence-to-graph (STOG) model.

G = STOG (S)

Let this AMR be a labeled directed graph

G = (V, E, ℓ)

. In this structure, V is the set of vertices (or nodes), which represent the core concepts and entities in the summary (e.g., actions like attack-01 or entities like rebel). E is the set of edges, representing all the connections between the nodes in V. An edge is a triple that defines a directed link: (source_node, relation_label, target_node); for instance, (attack-01, :ARG0, rebel) is one complete edge in the set E. Finally, ℓ is the labeling function, which assigns a name to each node, such that

ℓ (v)

is the node label.

We define the set of predicate nodes (e.g., impose-01, attack, capture).

P = {v \in V ∣ isPredicate (ℓ (v))}

and the set of discourse link relations.

L_{dl} = {: cause, : purpose, : time}

In Step 2 of Figure 1, we extract discourse link triples between predicates.

T = {(h, r, t) \in E ∣ h, t \in P, r \in L_{dl}}

In Figure 1; for example,

(impose - 01, : cause, attack)

and

(attack, : time, capture)

are such triples. For each

τ = (h, r, t) \in T

, we build a subgraph by keeping the discourse link information and expanding the local predicate–argument structure around its endpoints.

G_{τ} = {(h, r, t)} \cup ARG_NEIGH (h) \cup ARG_NEIGH (t)

where the one-hop argument neighborhood is defined as the set of outgoing edges from a node whose relations are in a predefined set

A

.

ARG_NEIGH (v) = {(v, a, u) \in E ∣ a \in A}

Here, the rule set

A

is configured to retain core semantic roles:

A = \{\begin{matrix} : ARG 0, : ARG 1, : ARG 2, : ARG 3, : ARG 4, : ARG 5 \end{matrix}\}

We exclude name, metadata and list-internal edges via

L_{skip} = {: wiki, : name, : op}

:ARGx capture “who did what to whom,” which designate the core roles of an event and thus are preserved as essential semantics. In contrast, :wiki, :name and :op are representational metadata that do not affect propositional content and introduce noise. This yields, for the discourse link triple

(impose - 01, : cause, attack)

, a subgraph that retains impose-01 with

: ARG 0 =

government and

: ARG 1 =

sanction, the discourse link triple

(impose - 01, : cause, attack)

itself, and attack with

: ARG 0 =

rebel and

: ARG 1 =

civilian, matching Figure 1 subgraphs.

3.1.2. Post-Processing

In Step 3 of Figure 1, each subgraph

G_{τ}

is first transformed into a candidate atomic content unit text

{\hat{a}}_{t}

using a graph-to-sequence (GTOS) model.

{\hat{a}}_{t} = GTOS (G_{τ})

In Step 4 of Figure 1, the resulting candidate texts

{\hat{a}}_{t}

undergo a three-step post-processing pipeline to ensure quality and relevance. First, a Coreference Resolution step is applied, where pronouns or other referential expressions in

{\hat{a}}_{t}

are resolved in the context of the full summary S to obtain the final expression

a_{t}

. Second, to ensure factual consistency with the summary, we apply NLI-based Factuality Filtering, keeping only those units

a_{t}

whose entailment score meets a threshold

τ

.

F_{DL} = {a_{t} ∣ f_{NLI_ENTAIL} (premise = S, hypothesis = a_{t}) \geq τ}

Finally, an Overlap Filtering and Merging step is performed to avoid redundancy with the ACUs already produced by a baseline system (

F_{base}

). The new DL-ACUs are merged only if their BERTScore similarity to any existing baseline ACU is below a threshold

δ

.

F = F_{base} \cup {a_{t} \in F_{DL} ∣ \forall a_{b} \in F_{base}, BERTScore (a_{t}, a_{b}) < δ}

For completeness, we summarize the entire DL-ACU pipeline in Algorithm 1, aligning the procedural flow with Figure 1 and making the subgraph extraction and post-processing reproducible.

Algorithm 1 DL-ACU: Discourse Link-Aware ACU Decomposition

Require: Summary S; STOG; GTOS; NLI scorer f_{NLI_ENTAIL}; coreference resolver Coref; thresholds τ, δ

Require: AMR role sets: predicate detector isPredicate(·); discourse labels

L_{dl} = {: cause, : purpose, : time}

; rule set

A

; skip set

L_{skip}

(:wiki, :name, :op)

Require: Baseline ACUs

F_{base}

(from the host Hallucination Detection system)

1:: $G = (V, E, ℓ) \leftarrow STOG (S)$
2:: $P \leftarrow {v \in V ∣ isPredicate (ℓ (v))}$
3:: $T \leftarrow {(h, r, t) \in E ∣ h, t \in P, r \in L_{dl}}$
4:: functionARG_NEIGH(v)
5:: $N \leftarrow Ø$
6:: for all $(v, a, u) \in E$ do
7:: if $a \in A$ and $a \notin L_{skip}$ then
8:: $N \leftarrow N \cup {(v, a, u)}$
9:: end if
10:: end for
11:: return $N$
12:: end function
13:: $\hat{F} \leftarrow Ø$
14:: for all $τ^{'} = (h, r, t) \in T$ do
15:: $G_{τ^{'}} \leftarrow {(h, r, t)} \cup A R G_N E I G H (h) \cup A R G_N E I G H (t)$
16:: $\hat{a} \leftarrow GTOS (G_{τ^{'}})$
17:: $a \leftarrow C o r e f (\hat{a}, context = S)$
18:: if $f_{NLI_ENTAIL} (premise = S, hypothesis = a) \geq τ$ then
19:: $\hat{F} \leftarrow \hat{F} \cup {a}$
20:: end if
21:: end for
22:: $F_{DL} \leftarrow {a \in \hat{F} ∣ \forall a_{b} \in F_{base}, BERTScore (a, a_{b}) < δ}$
23:: return $F \leftarrow F_{base} \cup F_{DL}$

3.1.3. Discussion and Practical Considerations

Through this process, we can extract information that explicitly reflects the discourse link between facts in addition to existing atomic content unit information. This is thanks to the structural characteristics of the AMR graph. For example, time expressions such as ‘before’ and ‘after’ are recognized as :time, and ambiguous expressions such as ‘to’ are recognized as :purpose, allowing for more accurate decomposition of semantic connections than simple text-matching-based techniques.

Several practical considerations are also addressed in our pipeline. The generated text expressions often include pronouns or demonstrative expressions, which must be appropriately replaced with proper nouns; in this study, we use the Maverick coreference resolution model [28]. Furthermore, when the performance of the STOG or GTOS models is low, the generated ACUs may be logically inconsistent, a correctness issue mentioned in WICE [29]. To prevent such errors, we filter the extracted ACUs in our post-processing step (Section 3.1.2) based on an NLI model to ensure they are entailed in the original summary. Additionally, to avoid redundancy with DL-ACUs that might already have been generated by a baseline system, we perform a BERTScore-based [30] similarity calculation and filter out duplicates. This is particularly important for soft aggregation systems like FENICE [3], whereas hard aggregation systems like FIZZ are more robust to such duplication.

3.2. Selective Document-Atomic Content Unit Decomposition

When verifying summaries, existing NLI-based hallucination detection systems [2,3,4] decompose summaries into ACUs for verification, but documents are usually compared only at the sentence level. This makes it difficult to achieve fine-grained alignment. This study decomposes sentences in the document into ACUs, enabling fine-grained alignment between the ACUs in the summary and document. Our method consists of three steps. An overview of this process is shown on the left side of Figure 1.

3.2.1. Entailment-Based Selection

Let the input document and summary be

D = {d_{1}, d_{2}, \dots, d_{n}}, S = {s_{1}, s_{2}, \dots, s_{m}}

We first obtain summary-side atomic content units using an LM [11,12] (as in the FIZZ and FENICE baselines).

A = ⋃_{k = 1}^{m} LM (s_{k}) = {a_{1}, a_{2}, \dots, a_{t}}

As shown in Figure 1, we perform NLI scoring between document sentences and summary ACUs. For each summary atomic content unit

a_{i} \in A

and each document sentence

d_{j} \in D

, we compute an entailment probability using an NLI model by treating

d_{j}

as the premise and

a_{i}

as the hypothesis.

Ent (d_{j}, a_{i}) = f_{NLI_ENTAIL} (p r e m i s e = d_{j}, h y p o t h e s i s = a_{i}) \in [0, 1]

These initial entailment scores are cached for reuse in the NLI scoring phase. We then select the top-k sentences with the highest scores.

D_{i}^{'} = Top - k ({d_{j} \in D ∣ Ent (d_{j}, a_{i})})

3.2.2. Atomic Content Unit Decomposition

In Step 5 of Figure 1, for each selected sentence

d_{u} \in D_{i}^{'}

, we extract document side atomic content units with the language model (LM). In Step 6 of Figure 1, we construct the Document-Atomic Content Unit set.

D_{chunked} = ⋃_{i = 1}^{t} {LM (d_{u}) ∣ d_{u} \in D_{i}^{'}}

Finally, in Step 7 of Figure 1, we perform NLI scoring and Aggregation between Document-Atomic Content Unit set and Summary Atomic Content Unit set following the baseline systems (hard aggregation in FIZZ; soft aggregation in FENICE).

3.2.3. Discussion and Practical Considerations

This method enables ACU-to-ACU alignment at the document level, allowing for precise comparison with the summary. In particular, it improves interpretability by identifying the facts in the document that are the basis for errors in the summary. Furthermore, decomposing all sentences in the entire document unnecessarily increases computational cost. Therefore, rather than decomposing the entire document, we selectively decompose only the document sentences most relevant for verifying each summary ACU. This reduces the cost of generating ACU with LM and the cost of calculating the entailment score between document chunks and summary chunks with the NLI model.

4. Experiments

4.1. Experimental Setup

We evaluate two modules for NLI-based hallucination detection in summarization: a Discourse Link-Aware Content Unit module and a Selective Document-Atomic Content Unit module. To assess the contribution of each component, the modules are integrated into recent baselines, FIZZ [2] and FENICE [3], both individually and in combination. All experiments are run on a single NVIDIA GeForce RTX 3090 GPU, and the implementation is publicly available at https://github.com/leadawon/DLSD-ACU (accessed on 6 September 2025). For the Discourse Link-Aware Content Unit module, AMR parsing (sequence-to-graph, STOG) and generation (graph-to-sequence, GTOS) rely on publicly available models at https://github.com/bjascob/amrlib-models (accessed on 15 March 2025) and for entailment scoring on two modules, we adopt the same off-the-shelf NLI model [31] used by FIZZ, namely tals/albert-xlarge-vitaminc-mnli.

We report balanced accuracy and ROC-AUC depending on the benchmark. For FRANK [5] in the SummaC benchmark [8], we additionally measure the discourse link error rate following prior work FactFT [6], which enables a focused assessment of discourse link error. This metric is calculated as the proportion of false positives within a given error category.

Error Rate = F P / N,

where

F P

is the number of false-positive predictions and N is the total number of examples belonging to that specific error category. Lower values indicate better performance.

In our main experiments, we set the NLI filtering threshold to

τ = 0.4

to decide whether a summary ACU is entailed by the original text, and we apply BERTScore-based overlap removal with a threshold

δ = 0.7

to reduce semantic redundancy. For top-k retrieval on the document side, we use

K = 1

, selecting the single most relevant sentence per summary ACU before document side ACU decomposition. We find that performance to be relatively insensitive to small variations around these values. Concretely, sweeping

τ

by

\pm 0.2

around

0.4

,

δ

by

\pm 0.1

around

0.7

, and K between 1 and 3 yields only marginal changes in accuracy. For coreference resolution, we use sapienzanlp/maverick-mes-ontonotes [28] (author-reported OntoNotes 5.0 test CoNLL F1 [32] 83.6). we adopt the released weights and tokenizer, without task specific fine tuning.

4.2. Evaluation Benchmarks

AggreFact [33] integrates nine existing datasets and consists of summary and document sentence pairs with human-annotated binary labels. The included sources are FRANK [5], SummEval [34], CLIFF [35], XSumFaith [36], Polytope [37], FactCC [15], Goyal’21 [38], Cao’22 [39], and Wang’20 [40]. Here, OLD denotes old summarizers such as Pointer-Generator [41] and BottomUp [42] models, EXFORMER denotes early Transformer-based summarizers (e.g., BERTSum [43] or GPT-2 [44] style systems), and FTSOTA denotes fine-tuned state-of-the-art pretrained Transformer summarizers (e.g., BART [25], PEGAsus [45], T5 [12]). The benchmark is organized by the family of the summarization model that produced the summaries (OLD, EXFORMER, FTSOTA) and by domain (CNN/DM [46], XSum [47]). CNN/DM typically supports more extractive, multi-sentence summaries, whereas XSum is highly abstractive and usually single-sentence. We focus on the AggreFact-FtSota subset, which reflects recent systems that generate fluent text while often exhibiting subtle and hard-to-detect factual hallucinations. For AggreFact-FtSota, we tune thresholds on the validation split and report balanced accuracy on the test split.

The SummaC [8] benchmark aggregates six datasets for factuality evaluation, each providing human binary labels for summary and document pairs. The constituent datasets are FactCC [15], FRANK [5], XSumFaith [36], CoGenSumm [48], Polytope [37], and SummEval [34]. As with AggreFact-FtSota, we use the validation split for threshold tuning and report balanced accuracy on the test split. For the FRANK subset, we additionally measure the LinkE error rate [6]. FRANK defines several error categories used in our analysis. LinkE occurs when a summary misstates the relation between otherwise correct facts, such as reversing event order or implying a false cause–effect. CorefE arises when referential expressions point to the wrong entity or are too ambiguous to resolve. GramE covers sentences whose grammar is so broken that factuality cannot be assessed. EntE denotes errors where the action is correct but the participants or their roles are wrong. CircE involves incorrect supporting details such as time, place, or manner. RelE indicates a mismatch in the core predicate or event. OutE refers to content introduced by the summary with no support in the source document.

DiverSumm [4] includes ChemSumm [49], ArXiv [50], GovReport [51], MultiNews [52], and QMSUM [53], covering scientific articles, meeting transcripts, government reports, and multi-document news. These datasets represent a mix of summarization styles; ChemSumm, QMSUM, and ArXiv feature primarily abstractive summaries, whereas GovReport and MultiNews contain more extractive summaries. Documents and summaries in this benchmark are also longer than in prior benchmarks.

FaithBench [16] targets hallucinations produced by recent LLM-based summarizers. It provides fine-grained hallucination labels created by human experts for summaries generated by ten modern LLMs. The dataset addresses the limitations of older benchmarks by focusing on challenging cases. Each summary span is annotated as consistent, benign, questionable, or unwanted. We use the official binary version released by the authors for direct evaluation. Because FaithBench does not include a validation split, we report ROC-AUC.

5. Results

Table 1, Table 2, Table 3, Table 4 and Table 5 present the performance of our proposed modules, DL-ACU and SD-ACU, when integrated into the FIZZ [2] and FENICE [3] baselines. The results indicate that our modules provide targeted improvements, with each showing distinct strengths on different types of benchmarks and error categories. The following subsections detail the specific contributions of each module.

5.1. Discourse Link-Aware Content Unit

We observe that DL-ACU performs well on its intended error type. As Table 2 shows the integration of DL-ACU significantly reduces the error rate for discourse link errors (LinkE) on FRANK, with reductions of 6.81 percentage points for FIZZ and 11.36 percentage points for FENICE. This reduction is substantially greater compared to other error types, confirming that our AMR-based approach successfully captures discourse link errors that standard ACU methods [2,3,4] often overlook. Interestingly, while the LinkE error rate decreases notably, the overall improvement in balanced accuracy on the FRANK dataset improves only slightly, by 0.05 percentage points in FIZZ, and decreases by 0.71 percentage points in FENICE (Table 1). This outcome is expected, as the DL-ACU module is specifically designed to address discourse link error (LinkE) and does not inherently target other prevalent error types.

Beyond its primary target, the DL-ACU module’s broader impact varies depending on the baseline. When applied to FIZZ, it reduces the error rate across nearly all error categories shown in Table 2, suggesting a general improvement. In contrast, with the stronger FENICE baseline, while the improvement on LinkE is even more pronounced (a reduction of 11.36 percentage points), we observe an increase in the error rate of 5.13 percentage points in Circumstance errors and 2.24 points in Entity errors when applied to FENICE. We interpret this mainly as an aggregation effect. FIZZ uses hard aggregation based on the minimum, so one most diagnostic ACU tends to decide the label, and extra DL-ACUs rarely hurt entity or circumstance cases. FENICE uses soft aggregation based on the mean, so DL-ACUs that are only partially entailed or close to neutral still enter the average and can nudge borderline entity or circumstance instances toward false positives.

Table 6 provides a qualitative comparison of how different decomposition models handle summaries containing explicit discourse link. These summaries from the FRANK dataset have markers like before, after, to, and because of. Such markers signify that for accurate factuality assessment, it is not enough to verify individual ACUs in isolation; the semantic connection between those ACUs must also be validated. In Examples 1 and 2, we observe that the baseline decomposers used by FIZZ and FENICE fail to capture discourse link relations, instead decomposing the summary into ACUs and losing the discourse link information. Our DL-ACU approach, in contrast, preserves the temporal order information. In Example 3, ORCA-2_FIZZ’s attempt to capture the purpose results in the awkward phrase (“the way is to”), which is not a well-formed atomic content unit, as the referent of “the way” is not specified. Many of the extracted ACUs also give the impression of being naively segmented triplets rather than genuinely considering discourse link relations. In contrast, T5-base_FENICE successfully captures the purpose relation by producing a fluent and coherent DL-ACU in this example. These examples also highlight secondary effects of our post-processing pipeline. During AMR graph construction and graph-to-sequence generation, entities are occasionally replaced by pronouns such as ‘he’ or ‘they’, potentially introducing referential ambiguity. However, as Example 2 shows, our Coreference Resolution module (see Section 3.1.2) successfully restores the original mentions (‘police’ and ‘shooter’). Finally, in Examples 3 and 4, all methods manage to correctly extract discourse link-aware content units; we observe that our Overlap Filtering module (see Section 3.1.2) effectively eliminates redundant or overlapping DL-ACUs, preventing distortion in the NLI scoring and aggregation phases.

5.2. Selective Document-Atomic Content Unit

The Selective Document-Atomic Content Unit (SD-ACU) module also shows notable performance gains, though the impact was not uniform across all datasets. We observe a more pronounced trend of improvement on benchmarks rich in extractive summaries. For instance, the gains are more evident on the CNN/DM portion of the AggreFact benchmark (Table 3), the MultiNews and GovReport subsets within DiverSumm (Table 4), and most constituent datasets of the SummaC benchmark, with the exception of the highly abstractive XSumFaith (Table 1).

The quantitative results further substantiate this trend. On the largely extractive SummaC benchmark (Table 1), the ‘FENICE + SD-ACU’ configuration achieves the highest average balanced accuracy of 77.12%, which is 1.36 percentage points higher than the baseline FENICE. This strong performance highlights the module’s effectiveness in contexts where direct, fine-grained alignment between summary and document facts is most beneficial.

The effectiveness of this module is illustrated with an example from the MultiNews dataset, where a summary sentence appears verbatim in the source document: “Action News Jax spoke to Professor Tiri Fellows, who is deaf and teaches American Sign Language.” Baseline decomposers like ORCA-2_FIZZ and T5-base_FENICE extract multiple ACUs from this summary, including ACU “Tiri Fellows is deaf.” In an NLI scoring phase in Figure 1, this simple ACU (as the hypothesis) is compared against the entire, complex document sentence as the premise, which contains other potentially distracting facts such as “Action News Jax spoke to Professor Tiri Fellows” or “Professor Tiri Fellows teaches American Sign Language.”. In contrast, our SD-ACU method decomposes the document sentence into its own atomic content units. This enables a far more precise ACU-to-ACU alignment by creating a perfectly matched premise. In this case, the premise also becomes “Tiri Fellows is deaf.”. This symmetrical alignment provides a much cleaner and more robust signal to the NLI model. We hypothesize that this is because symmetrically matching the granularity of both the document and the summary is particularly beneficial when summary facts have direct counterparts in the source text, simplifying the factuality verification task.

5.3. Synergistic Effects and Overall Performance

When both modules are combined, they often exhibit a synergistic effect, addressing a wider range of factual inconsistencies. This is particularly evident on the FaithBench benchmark (Table 5), where the full ‘DL-ACU + SD-ACU’ model achieves the highest ROC-AUC for both the FIZZ and FENICE families, demonstrating consistent, incremental improvements. Similarly, on the AggreFact-FtSota benchmark (Table 3), the combined ‘FENICE + DL-ACU + SD-ACU’ model obtains the best overall average balanced accuracy among all systems. These results suggest that the two modules are complementary. The DL-ACU captures complex discourse link errors, while the SD-ACU improves the precision of atomic fact verification, especially in extractive contexts.

6. Conclusions

Recent research on hallucination detection in summarization involves decomposing the summary into atomic content units (ACUs) and then verifying the factuality of each ACU in relation to the original text. However, as the ACUs generated in this process become increasingly shorter, limitations arise in that discourse link information such as temporal order, causality, and purpose are not taken into consideration. To address this limitation, this study converts the summary into an AMR graph and extracts Discourse Link-Aware Content Units by constructing subgraphs centered on discourse link relations. In addition, unlike previous studies that decomposed only the summary into ACUs, this study selectively decomposes only specific parts of the document corresponding to the summary ACUs, enabling precise entailment evaluation at the ACU-to-ACU level. As a result, the accuracy of NLI-based hallucination detection in summarization systems is improved.

The experiments are conducted using the AggreFact-FtSota, DiverSumm, FaithBench, and SummaC benchmarks, showing performance improvements over the FIZZ and FENICE baselines in most cases. In particular, for the discourse link error type LinkE in the FRANK, a significant reduction in the error rate is achieved compared to other error types. This study is inspired by the requirement that an ACU must not deviate from the summary’s content while fully incorporating all conveyed meaning. We propose a Discourse Link-Aware Content Unit based on the AMR graph, complemented by our Selective Document-Atomic Content Unit approach for precise and efficient alignment, and demonstrate that these methods together can address the fundamental limitations of recent NLI-based detection methods that decompose the summary into factually incomplete ACUs.

7. Limitations

Our approach yields consistent gains on benchmarks where discourse link errors are explicitly present, but the improvements are less stable on benchmarks that do not intentionally inject such errors. This suggests that the method is relatively specialized for discourse link error. The contribution is a simple augmentation that adds atomic content units carrying discourse links to the pipeline, and it primarily targets the LinkE category in FRANK, which limits coverage of other types of factual errors.

Given that recent ACU decomposition typically relies on generative models, it is important to ensure both conciseness and completeness between the summary and its ACUs. Here, conciseness means that each summary ACU must be supported by the summary itself without introducing new information or over-extending the interpretation. We enforce this criterion by applying an NLI-based filter that checks whether each ACU is entailed by the summary. In contrast, completeness means that the set of ACUs should collectively capture all of the meaning conveyed by the summary. One frequent omission is the discourse link relation addressed in this work, but other information can also be missed. In particular, modality and hedging (e.g., uncertainty markers), attribution and perspective, and non-core semantic roles (e.g., manner, condition, concession) may still be lost by decomposition, while our module restores discourse links, an ACU-free approach could further mitigate such losses. A more systematic investigation of such omissions, together with principled methods to assess and mitigate them, remains necessary.

Finally, because a modular pipeline (AMR parsing → coreference resolution → NLI filtering) can accumulate mismatches and propagate errors across components, an end-to-end verifier that jointly parses, aligns, and judges factuality is a compelling future direction. Such a model could reduce threshold sensitivity and component-level brittleness.

Author Contributions

Conceptualization, D.L. and H.J.; methodology, D.L., H.J. and Y.S.C.; software, D.L.; validation, D.L. and Y.S.C.; formal analysis, D.L. and Y.S.C.; investigation, D.L.; resources, D.L., H.J. and Y.S.C.; data curation, D.L.; writing—original draft preparation, D.L. and Y.S.C.; writing—review and editing, D.L., H.J. and Y.S.C.; visualization, D.L.; supervision, H.J. and Y.S.C.; project administration, Y.S.C.; funding acquisition, Y.S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Institute of Information and Communications Technology Planning and Evaluation (IITP) grant (No. RS-2025-25422680, No. RS-2020-II201373), and the National Research Foundation of Korea (NRF) grant (No. RS-2025-00520618) funded by the Korean Government (MSIT).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The SummaC dataset can be downloaded from https://github.com/tingofurro/summac (accessed on 25 March 2025). The AggreFact-FtSota dataset can be downloaded from https://github.com/Liyan06/AggreFact (accessed on 10 April 2025). The DiverSumm dataset can be downloaded from https://github.com/HJZnlp/Infuse (accessed on 15 May 2025). The FaithBench dataset can be downloaded from https://github.com/vectara/FaithBench (accessed on 19 June 2025).

Conflicts of Interest

Author Hyuckchul Jung was employed by the company Meta. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ACU	Atomic Content Unit
DL-ACU	Discourse Link-Aware Content Unit
SD-ACU	Selective Document-Atomic Content Unit

References

Liu, Y.; Fabbri, A.; Liu, P.; Zhao, Y.; Nan, L.; Han, R.; Han, S.; Joty, S.; Wu, C.S.; Xiong, C.; et al. Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 4140–4170. [Google Scholar]
Yang, J.; Yoon, S.; Kim, B.; Lee, H. FIZZ: Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 30–45. [Google Scholar]
Scirè, A.; Ghonim, K.; Navigli, R. FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 14148–14161. [Google Scholar]
Zhang, H.; Xu, Y.; Perez-Beltrachini, L. Fine-Grained Natural Language Inference Based Faithfulness Evaluation for Diverse Summarisation Tasks. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), St. Julian’s, Malta, 17–22 March 2024; pp. 1701–1722. [Google Scholar]
Pagnoni, A.; Balachandran, V.; Tsvetkov, Y. Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 4812–4829. [Google Scholar]
Luo, G.; Fan, W.; Li, M.; He, Y.; Yang, Y.; Bao, F. On the Intractability to Synthesize Factual Inconsistencies in Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, 17–22 March 2024; pp. 1026–1037. [Google Scholar]
Qiu, H.; Huang, K.H.; Qu, J.; Peng, N. AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; pp. 594–608. [Google Scholar]
Laban, P.; Schnabel, T.; Bennett, P.N.; Hearst, M.A. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. Trans. Assoc. Comput. Linguist. 2022, 10, 163–177. [Google Scholar] [CrossRef]
Zha, Y.; Yang, Y.; Li, R.; Hu, Z. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 11328–11348. [Google Scholar]
Stacey, J.; Minervini, P.; Dubossarsky, H.; Camburu, O.M.; Rei, M. Atomic Inference for NLI with Generated Facts as Atoms. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 10188–10204. [Google Scholar]
Mitra, A.; Corro, L.D.; Mahajan, S.; Codas, A.; Simoes, C.; Agarwal, S.; Chen, X.; Razdaibiedina, A.; Jones, E.; Aggarwal, K.; et al. Orca 2: Teaching Small Language Models How to Reason. arXiv 2023, arXiv:2311.11045. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Banarescu, L.; Bonial, C.; Cai, S.; Georgescu, M.; Griffitt, K.; Hermjakob, U.; Knight, K.; Koehn, P.; Palmer, M.; Schneider, N. Abstract Meaning Representation for Sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, 8–9 August 2013; pp. 178–186. [Google Scholar]
Chen, Y.; Eger, S. MENLI: Robust Evaluation Metrics from Natural Language Inference. Trans. Assoc. Comput. Linguist. 2023, 11, 804–825. [Google Scholar] [CrossRef]
Kryscinski, W.; McCann, B.; Xiong, C.; Socher, R. Evaluating the Factual Consistency of Abstractive Text Summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 9332–9346. [Google Scholar]
Bao, F.S.; Li, M.; Qu, R.; Luo, G.; Wan, E.; Tang, Y.; Fan, W.; Tamber, M.S.; Kazi, S.; Sourabh, V.; et al. FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Albuquerque, Mexico, 29 April–4 May 2025; pp. 448–461. [Google Scholar]
Scialom, T.; Dray, P.A.; Lamprier, S.; Piwowarski, B.; Staiano, J.; Wang, A.; Gallinari, P. QuestEval: Summarization Asks for Fact-based Evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6594–6604. [Google Scholar]
Fabbri, A.; Wu, C.S.; Liu, W.; Xiong, C. QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 2587–2601. [Google Scholar]
Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 2511–2522. [Google Scholar]
Wan, D.; Sinha, K.; Iyer, S.; Celikyilmaz, A.; Bansal, M.; Pasunuru, R. ACUEval: Fine-grained Hallucination Evaluation and Correction for Abstractive Summarization. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 10036–10056. [Google Scholar]
Ribeiro, L.F.R.; Liu, M.; Gurevych, I.; Dreyer, M.; Bansal, M. FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3238–3253. [Google Scholar]
Goyal, T.; Durrett, G. Evaluating Factuality in Generation with Dependency-level Entailment. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3592–3603. [Google Scholar]
Nenkova, A.; Passonneau, R. Evaluating Content Selection in Summarization: The Pyramid Method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, MA, USA, 2–7 April 2004; pp. 145–152. [Google Scholar]
Goodman, M.W. Penman: An Open-Source Library and Tool for AMR Graphs. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 312–319. [Google Scholar]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7871–7880. [Google Scholar]
Wein, S.; Opitz, J. A Survey of AMR Applications. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 6856–6875. [Google Scholar]
Nawrath, M.; Nowak, A.; Ratz, T.; Walenta, D.; Opitz, J.; Ribeiro, L.; Sedoc, J.; Deutsch, D.; Mille, S.; Liu, Y.; et al. On the Role of Summary Content Units in Text Summarization Evaluation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), Mexico City, Mexico, 16–21 June 2024; pp. 272–281. [Google Scholar]
Martinelli, G.; Barba, E.; Navigli, R. Maverick: Efficient and Accurate Coreference Resolution Defying Recent Trends. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 13380–13394. [Google Scholar]
Kamoi, R.; Goyal, T.; Diego Rodriguez, J.; Durrett, G. WiCE: Real-World Entailment for Claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 7561–7583. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Schuster, T.; Fisch, A.; Barzilay, R. Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 624–643. [Google Scholar]
Pradhan, S.; Moschitti, A.; Xue, N.; Uryupina, O.; Zhang, Y. CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In Proceedings of the Joint Conference on EMNLP and CoNLL, Shared Task, Jeju Island, Republic of Korea, 12–14 July 2012; pp. 1–40. [Google Scholar]
Tang, L.; Goyal, T.; Fabbri, A.; Laban, P.; Xu, J.; Yavuz, S.; Kryscinski, W.; Rousseau, J.; Durrett, G. Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 11626–11644. [Google Scholar]
Fabbri, A.R.; Kryściński, W.; McCann, B.; Xiong, C.; Socher, R.; Radev, D. SummEval: Re-evaluating Summarization Evaluation. Trans. Assoc. Comput. Linguist. 2021, 9, 391–409. [Google Scholar] [CrossRef]
Cao, S.; Wang, L. CLIFF: Contrastive Learning for Improving Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6633–6649. [Google Scholar]
Maynez, J.; Narayan, S.; Bohnet, B.; McDonald, R. On Faithfulness and Factuality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1906–1919. [Google Scholar]
Huang, D.; Cui, L.; Yang, S.; Bao, G.; Wang, K.; Xie, J.; Zhang, Y. What Have We Achieved on Text Summarization? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 446–469. [Google Scholar]
Goyal, T.; Durrett, G. Annotating and Modeling Fine-grained Factuality in Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1449–1462. [Google Scholar]
Cao, M.; Dong, Y.; Cheung, J. Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3340–3354. [Google Scholar]
Wang, A.; Cho, K.; Lewis, M. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 5008–5020. [Google Scholar]
See, A.; Liu, P.J.; Manning, C.D. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1073–1083. [Google Scholar]
Gehrmann, S.; Deng, Y.; Rush, A. Bottom-Up Abstractive Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4098–4109. [Google Scholar]
Liu, Y.; Lapata, M. Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3730–3740. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P.J. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020. [Google Scholar]
Hermann, K.M.; Kočiský, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read and Comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 1693–1701. [Google Scholar]
Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1797–1807. [Google Scholar]
Falke, T.; Ribeiro, L.F.R.; Utama, P.A.; Dagan, I.; Gurevych, I. Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2214–2220. [Google Scholar]
Adams, G.; Nguyen, B.; Smith, J.; Xia, Y.; Xie, S.; Ostropolets, A.; Deb, B.; Chen, Y.J.; Naumann, T.; Elhadad, N. What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 10520–10542. [Google Scholar]
Cohan, A.; Dernoncourt, F.; Kim, D.S.; Bui, T.; Kim, S.; Chang, W.; Goharian, N. A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; pp. 615–621. [Google Scholar]
Huang, L.; Cao, S.; Parulian, N.; Ji, H.; Wang, L. Efficient Attentions for Long Document Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 1419–1436. [Google Scholar]
Fabbri, A.; Li, I.; She, T.; Li, S.; Radev, D. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1074–1084. [Google Scholar]
Zhong, M.; Yin, D.; Yu, T.; Zaidi, A.; Mutuma, M.; Jha, R.; Awadallah, A.H.; Celikyilmaz, A.; Liu, Y.; Qiu, X.; et al. QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5905–5921. [Google Scholar]

Figure 1. Overview of our two complementary components. (Left) The pipeline for Selective Document-Atomic Content Unit (SD-ACU) Decomposition. (Right) The pipeline for Discourse Link-Aware Content Unit (DL-ACU) Decomposition. Numbered steps (1–7) align with the pipeline described in Section 3.

Table 1. Balanced accuracy (%) on SummaC subsets. AVG is the average of the six datasets. Bold denotes the best score in each column.

Method	CoGenSumm	FactCC	FRANK	SummEval	Polytope	XSumFaith	AVG
Other comparators
DAE	63.40	75.90	61.70	70.30	62.80	50.80	64.15
FactCC	65.40	77.30	59.40	58.53	58.20	56.54	62.56
SummaC-ZS	70.40	83.81	78.51	78.70	62.00	58.40	71.97
SummaC-Conv	64.70	89.06	81.62	81.70	62.70	65.44	74.20
MENLI	55.20	58.17	75.11	52.73	57.07	67.15	60.91
MFMA	64.13	84.88	80.62	75.50	58.31	55.07	69.75
AlignScore	70.16	85.88	82.06	56.42	60.64	73.44	71.43
InFusE	75.61	87.59	79.39	73.54	54.44	67.64	73.04
FIZZ family
FIZZ (baseline)	59.18	78.66	80.44	62.92	58.09	72.33	68.60
FIZZ + DL-ACU	62.43	79.46	80.49	63.09	58.47	72.72	69.44
FIZZ + SD-ACU	63.23	81.50	80.70	64.17	60.14	72.55	70.38
FIZZ + DL-ACU + SD-ACU	63.45	81.52	80.68	63.35	58.89	72.97	70.14
FENICE family
FENICE (baseline)	76.50	84.29	83.05	72.32	65.48	72.90	75.76
FENICE + DL-ACU	76.43	84.29	82.34	72.29	65.49	72.62	75.58
FENICE + SD-ACU	78.02	85.33	83.28	76.66	66.79	72.66	77.12
FENICE + DL-ACU + SD-ACU	77.87	84.98	83.33	74.89	66.24	73.83	76.86

Table 2. Error Rate (%, smaller is better) by error category. Rows labeled

Δ

report the difference from the family baseline, computed as (baseline − method) in percentage points (pp); thus, positive

Δ

indicates a reduction in error rate (better), while negative indicates an increase (worse).

Table 2. Error Rate (%, smaller is better) by error category. Rows labeled

Δ

report the difference from the family baseline, computed as (baseline − method) in percentage points (pp); thus, positive

Δ

indicates a reduction in error rate (better), while negative indicates an increase (worse).

Method	LinkE	CorefE	GramE	EntE	CircE	RelE	OutE	OtherE
FIZZ family
FIZZ (baseline)	45.45	70.75	34.69	30.10	21.79	15.13	5.35	75.00
FIZZ + DL-ACU	38.64	66.98	33.33	25.37	20.51	14.47	4.71	60.00
$Δ$ vs. baseline (pp)	6.81	3.77	1.36	4.73	1.28	0.66	0.64	15.00
FENICE family
FENICE (baseline)	90.91	94.34	72.11	65.92	55.13	49.34	32.33	90.00
FENICE + DL-ACU	79.55	93.40	73.47	68.16	60.26	49.34	34.48	90.00
$Δ$ vs. baseline (pp)	11.36	0.94	−1.36	−2.24	−5.13	0	−2.15	0

Table 3. Balanced accuracy (%) on AggreFact-FtSota subsets. AVG is the average of the two columns. Bold denotes the best score in each column.

Method	AggreFact-Cnn-FtSota	AggreFact-XSum-FtSota	AVG
Other comparators
DAE	59.40	73.10	66.25
FactCC	57.60	54.88	56.24
SummaC-ZS	65.19	54.08	59.64
SummaC-Conv	61.72	63.52	62.62
MENLI	62.24	65.30	63.77
MFMA	61.86	55.00	58.43
AlignScore	62.72	69.44	66.08
InFusE	64.51	65.82	65.16
FIZZ family
FIZZ (baseline)	65.86	69.25	67.56
FIZZ + DL-ACU	67.11	69.82	68.47
FIZZ + SD-ACU	66.46	69.44	67.95
FIZZ + DL-ACU + SD-ACU	67.71	70.18	68.95
FENICE family
FENICE (baseline)	66.23	73.83	70.03
FENICE + DL-ACU	66.29	74.08	70.19
FENICE + SD-ACU	68.43	73.44	70.94
FENICE + DL-ACU + SD-ACU	67.17	74.86	71.02

Table 4. ROC-AUC (%) on DiverSumm subsets. CSM, MNW, QMS, AXV, and GOV refer to ChemSum, MultiNews, QMSUM, ArXiv, and GovReport, respectively. AVG is the average of the the five datasets. Bold denotes the best score in each column.

Method	CSM	MNW	QMS	AXV	GOV	AVG
Other comparators
FactCC	50.36	34.41	46.62	61.87	66.97	52.05
SummaC-ZS	59.36	46.72	44.92	68.16	72.58	58.75
SummaC-Conv	53.76	52.70	49.44	61.50	71.13	57.71
MENLI	53.19	60.53	44.45	66.85	34.39	51.48
MFMA	60.34	49.94	43.33	72.30	60.32	57.65
AlignScore	58.59	42.69	59.21	72.77	85.25	63.30
InFusE	46.04	40.05	46.71	70.49	78.19	56.70
FIZZ family
FIZZ (baseline)	54.01	38.67	51.22	62.15	64.16	54.44
FIZZ + DL-ACU	53.34	38.20	49.34	61.34	64.52	53.34
FIZZ + SD-ACU	54.27	39.82	51.79	62.81	66.20	54.98
FIZZ + DL-ACU + SD-ACU	54.89	40.16	52.44	62.62	66.02	55.23
FENICE family
FENICE (baseline)	54.83	43.50	55.92	75.94	74.34	60.91
FENICE + DL-ACU	52.60	43.96	56.11	76.13	70.23	59.41
FENICE + SD-ACU	51.57	47.18	50.00	73.83	77.47	63.57
FENICE + DL-ACU + SD-ACU	50.60	45.45	52.16	72.27	75.84	59.26

Table 5. ROC-AUC (%) on FaithBench. Bold denotes the best score.

Method	FaithBench
Other comparators
FactCC	49.54
SummaC-ZS	47.52
SummaC-Conv	52.16
MENLI	49.78
MFMA	52.09
AlignScore	48.33
InFusE	49.87
FIZZ family
FIZZ (baseline)	52.50
FIZZ + DL-ACU	53.32
FIZZ + SD-ACU	54.27
FIZZ + DL-ACU + SD-ACU	54.57
FENICE family
FENICE (baseline)	59.05
FENICE + DL-ACU	60.20
FENICE + SD-ACU	60.81
FENICE + DL-ACU + SD-ACU	61.00

Table 6. Examples from FRANK in SummaC. Baselines show ACUs produced separately by FIZZ (paired with ORCA) and FENICE (paired with T5-base). Our method yields a single discourse link-aware content unit that preserves temporal order, causality, and purpose with discourse markers highlighted in red.

ID	FRANK Summary	FIZZ_ORCA-2	FENICE_T5-base	Our Method (DL-ACU)
1	However, the shooter tried to ram the gates before firing at the guard at least once.	- A shooter was involved. - The shooter tried to ram the gates. - The shooter fired at the guard. - The guard was fired at least once.	- The shooter tried to ram the gates. - The shooter fired at the guard at least once.	- The shooter rammed the gates before firing at the guard.
2	Police believe the shooter barricaded himself inside after noticing a couple fighting in a car.	- Police believe the shooter barricaded himself inside. - Shooter noticed a couple fighting in a car.	- Police believe the shooter barricaded himself inside. - The shooter noticed a couple fighting in a car.	- Police believe the shooter barricaded himself inside after noticing a couple fighting in a car.
3	French scientists say they have found a way to hide the earth’s vast mountains of Mont Blanc.	- French scientists have found a way. - The way is to hide the earth’s vast mountains. - The earth’s vast mountains are called Mont Blanc.	- French scientists found a way to hide the earth’s vast mountains of Mont Blanc	- French scientists found a way in order to hide the earth’s vast mountains of Mont Blanc.
4	The banks in the Indian capital, Delhi, have been shut down because of corruption and corruption, the BBC has learned.	- Banks in Delhi have been shut down. - The reason for the shut down is corruption. - The source of the information is the BBC.	- The banks in the Indian capital, Delhi, have been shut down because of corruption.	- The banks in Delhi were shut down because of corruption, the BBC has learned.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lee, D.; Jung, H.; Choi, Y.S. Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization. Appl. Sci. 2025, 15, 10506. https://doi.org/10.3390/app151910506

AMA Style

Lee D, Jung H, Choi YS. Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization. Applied Sciences. 2025; 15(19):10506. https://doi.org/10.3390/app151910506

Chicago/Turabian Style

Lee, Dawon, Hyuckchul Jung, and Yong Suk Choi. 2025. "Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization" Applied Sciences 15, no. 19: 10506. https://doi.org/10.3390/app151910506

APA Style

Lee, D., Jung, H., & Choi, Y. S. (2025). Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization. Applied Sciences, 15(19), 10506. https://doi.org/10.3390/app151910506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mind the Link: Discourse Link-Aware Hallucination Detection in Summarization

Abstract

1. Introduction

2. Related Works

2.1. Hallucination Detection in Summarization

2.2. Atomic Content Unit Decomposition

2.3. AMR Graph

3. Methods

3.1. Discourse Link-Aware Content Unit Decomposition Using AMR Graphs

3.1.1. Subgraph Extraction Based on Discourse Link

3.1.2. Post-Processing

3.1.3. Discussion and Practical Considerations

3.2. Selective Document-Atomic Content Unit Decomposition

3.2.1. Entailment-Based Selection

3.2.2. Atomic Content Unit Decomposition

3.2.3. Discussion and Practical Considerations

4. Experiments

4.1. Experimental Setup

4.2. Evaluation Benchmarks

5. Results

5.1. Discourse Link-Aware Content Unit

5.2. Selective Document-Atomic Content Unit

5.3. Synergistic Effects and Overall Performance

6. Conclusions

7. Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI