Generating Fluent Fact Checking Explanations with Unsupervised Post-Editing

: Fact-checking systems have become important tools to verify fake and misguiding news. These systems become more trustworthy when human-readable explanations accompany the veracity labels. However, manual collection of these explanations is expensive and time-consuming. Recent work has used extractive summarization to select a sufﬁcient subset of the most important facts from the ruling comments (RCs) of a professional journalist to obtain fact-checking explanations. However, these explanations lack ﬂuency and sentence coherence. In this work, we present an iterative edit-based algorithm that uses only phrase-level edits to perform unsupervised post-editing of disconnected RCs. To regulate our editing algorithm, we use a scoring function with components including ﬂuency and semantic preservation. In addition, we show the applicability of our approach in a completely unsupervised setting. We experiment with two benchmark datasets, namely LIAR-PLUS and PubHealth. We show that our model generates explanations that are ﬂuent, readable, non-redundant, and cover important information for the fact check.


INTRODUCTION
In today's era of social media, the spread of news is a click away, regardless of whether it is fake or real.However, the quick propagation of fake news has repercussions on peoples' lives.To alleviate these consequences, independent teams of professional fact checkers manually verify the veracity and credibility of news, which is time and labor-intensive, making the process expensive and less scalable.Therefore, the need for accurate, scalable, and explainable automatic fact checking systems is inevitable [16].
Current automatic fact checking systems perform veracity prediction for given claims based on evidence documents (Augenstein et al. [6], Thorne et al. [37], inter alia), or based on long lists of supporting ruling comments (RCs, Alhindi et al. [1], Wang [38]).RCs are in-depth explanations for predicted veracity labels, but they are challenging to read and not useful as explanations for human readers due to their sizable content.Recent work [5,17] has thus proposed to use automatic summarization to select a subset of sentences from long RCs and used them as short layman explanations.However, using a purely extractive approach [5] means sentences are cherry-picked from different parts of the corresponding RCs, and as a result, explanations are often disjoint and non-fluent.
While a sequence-to-sequence model trained on parallel data can partially alleviate these problems, as Kotonya and Toni [17] propose, it is an expensive affair in terms of the large amount of data and compute required to train these models.Therefore, in this work, we focus on unsupervised post-editing of explanations extracted from RCs.
In recent studies, researchers have addressed unsupervised post-editing to generate paraphrases [20] and sentence simplification [18].However, they use small single sentences and perform exhaustive word-level or a combination of word and phrase-level edits, which has limited applicability for longer text inputs with multiple sentences, e.g., veracity explanations, due to prohibitive convergence times.
EU suspends delivery of 10 million masks over quality issues.

Claim
After a first batch of 1.5 million masks was shipped to 17 of the 27 member states and Britain, 600,000 items did not have European certificates and medical standards.As part of its efforts to tackle the COVID-19 crisis, this month the EU's executive arm started dispatching the masks to health care workers.(R) It was set to be distributed in weekly installments over six weeks.(D) "We have decided to suspend future deliveries of these masks," Commission health spokesman Stefan De Keersmaecker said.(P)

Explanation from Ruling Comments
As part of its efforts to tackle the COVID-19 crisis, this month the EU's executive arm started dispatching the masks to health care workers.(R) After a first batch of 1.5 million masks was shipped to 17 of the 27 member states and Britain, 600,000 items did not have European certificates and did not comply with (I) medical standards.The Commission has decided to stop future deliveries of these masks, De Keersmaecker said.(P) Post-Edited Explanation Label: False Fig. 1.Example of a post-edited explanation from PubHealth that was initially extracted from RCs.We illustrate four post-editing steps: reordering (R), insertion (I), deletion (D), and paraphrasing (P).
that are more concise, readable, fluent, and creating a coherent story.Our proposed method finds the best postedited explanation candidate according to a scoring function, ensuring the quality of explanations in fluency and readability, semantic preservation, and conciseness ( §3.2.2).To ensure that the sentences of the candidate explanations are grammatically correct, we also perform grammar checking ( §3.2.4).As a second step, we apply paraphrasing to further improve the conciseness and human readability of the explanations ( §3.2.5).
In summary, our main contributions are as follows: • To the best of our knowledge, we are the first to explore an iterative unsupervised edit-based algorithm using only phrase-level edits that leads to feasible solutions for long text inputs.• We show how combining an iterative algorithm with grammatical corrections, and paraphrasing-based postprocessing leads to fluent and easy-to-read explanations.
• We conduct extensive experiments on the LIAR-PLUS [38] and PubHealth [17] fact checking datasets.Our automated evaluation confirms the success of our proposed approach in preserving the semantics important for the fact check and enhancing the readability of the generated explanations.Our manual evaluation confirms that our approach improves the fluency and the conciseness of the generated explanations.

RELATED WORK
The most closely related streams of approaches to our work are explainable fact checking, generative approaches to explainability and post-editing for language generation.

Explainable Fact Checking
Recent work has produced fact-checking explanations by highlighting words in tweets using neural attention [23].
However, their explanations are used only to evaluate and compare the proposed model with other baselines without neural attention.Wu et al. [39] propose to model evidence documents with decision trees, which are inherently interpretable ML models.In a recent study, Atanasova et al. [5] present a multi-task approach to generate freetext explanations for political claims jointly with predicting the veracity of claims.They formulate an extractive summarization task to select a few important sentences from a long fact checking report.Atanasova et al. [4] also perform extractive explanation generation guided by a set of diagnostic properties of explanations and evaluate on the FEVER [37] fact checking dataset, where explanation sentences have to be extracted from Wikipedia documents.
In the domain of public health claims, Kotonya and Toni [17] propose to generate explanations separately from the task of veracity prediction.Mishra et al. [25] generates summaries of evidence documents from the Web using an attention-based mechanism.They show that their summaries perform better than using the original evidence documents directly.Similarly to Atanasova et al. [5], Kotonya and Toni [17], we present a generative approach for creating fact checking explanations.In contrast to related work, we propose an unsupervised post-editing approach to improve the fluency and readability of previously extracted fact checking explanations.

Generative Approaches to Explainability
While most work on explanation generation propose methods to highlight portions of inputs (Camburu et al. [10], DeYoung et al. [12], inter alia), some work focuses on generative approaches to explainability.Camburu et al. [10] propose combining an explanation generation and a target prediction model in a pipeline or a joint model for Natural Language Inference with abstractive explanations about the entailment of two sentences.Stammbach and Ash [34] propose few-shot training for the GPT-3 [9] model to explain a fact check from retrieved evidence snippets.GPT-3, however, is a limited-access model with high computational costs.As in our work, Kotonya and Toni [17] first extract evidence sentences, which are then summarised by an abstractive summarisation model.The latter is trained on the PubHealth dataset.In contrast, we are the first to focus on unsupervised post-editing of explanations produced using automatic summarization.

Post-Editing for Language Generation
Previous work has addressed unsupervised post-editing for multiple tasks like paraphrase generation [20], sentence simplification [18] or sentence summarization [32].However, all these tasks handle shorter inputs in comparison to the long multi-sentence extractive explanations that we have.Furthermore, they perform exhaustive edit operations at the word level and sometimes additionally at the phrase level, both of which increase computation and inference complexity.Therefore, we present a novel approach that performs a fixed number of edits only at the phrase level followed by grammar correction and paraphrasing.

METHOD
Our method is comprised of two steps.First, we select sentences from RCs that serve as extractive explanations for verifying claims ( §3.1).We then apply a post-editing algorithm on the extractive explanations in order to improve their fluency and coherence ( §3.2).

Selecting Sentences for Post-Editing
Supervised Selection.To produce supervised extractive explanations, we build models based on DistilBERT [31] for LIAR-PLUS, and SciBERT [7] for PubHealth to allow for direct comparison with Atanasova et al. [5], Kotonya and Toni [17].We supervise explanation generation by  greedily selected sentences from each claim's RCs that achieve the highest ROUGE-2 F1 score when compared to the gold justification.We choose  = 4 for LIAR-PLUS and  = 3 for PubHealth, the average number of sentences in the veracity justifications in the corresponding datasets.The selected sentences are positive gold labels, y  ∈ {0, 1}  , where  is the number of sentences in the RCs.We also use the veracity labels y  ∈   for supervision.Following Atanasova et al. [3], we then learn a multi-task model ( ) = (p  , p  ).Given the input X, comprised of a claim and the RCs, it predicts jointly the veracity explanation p  and the veracity label p  , where p  ∈ R 1, selects sentences for explanation, i.e. {0,1}, and p  ∈ R  , with  = 6 for LIAR-PLUS, and  = 4 for PubHealth.Finally, we optimise the joint cross-entropy loss function L  = H (p  , y  ) + H (p  , y  ).
Unsupervised Selection.We also experiment with unsupervised selection of sentences to test the possibility to construct fluent fact checking explanations in an entirely unsupervised way.We use a Longformer [8] model, which was introduced for tasks with longer input, instead of the sliding-window approach also used in Atanasova et al. [3], which is without cross-window attention.We train a model ℎ( ) = p  to predict the veracity of a claim.We optimise a cross-entropy loss function L  = H (p  , y  ) and select  sentences p  ′ ∈ R 1, , {0, 1}, with the highest saliency scores.
The saliency score of a sentence is the sum of the saliency scores of its tokens.The saliency of a token is the gradient of the input token w.r.t. the output [33].We selected sentences using the raw gradients as Atanasova et al. [2] show that different gradient-based methods yield similar results.As the selection could be noisy [15], we consider these experiments as only complementary to the main supervised results.

Post-Editing
Our post-editing is completely unsupervised and operates on sentences obtained in Sec.3.1.It is a search algorithm that evaluates the candidate sequence p  for a given input sequence, where the input sequence is either p  for supervised selection or p  ′ for unsupervised selection.Below, we use p  as a representative of both p  and p  ′ .
Given p  , we iteratively generate multiple candidates by performing phrase-level edits ( §3.2.1).To evaluate a candidate explanation, we define a scoring function, which is a product of multiple scorers, also known as a product-of-experts model [13].Our scoring function includes fluency and semantic preservation, and controls the length of the candidate explanation ( §3.2.2).We repeat the process for  steps and select the last best-scoring candidate as our final output.We then use grammar correction ( §3.2.4) and paraphrasing ( §3.2.5) to further ensure conciseness and human readability.

Candidate sequence generation.
We generate candidate sequences by phrase-level edits.We use the off-the-shelf syntactic parser from CoreNLP [24] to obtain the constituency tree of a candidate sequence p  .As p  is long, we perform all operations at the phrase level.At each step , our algorithm first randomly picks one operation -insertion, deletion, or reordering, and then randomly selects a phrase.
For insertion, our algorithm inserts a <MASK> token before the randomly selected phrase, and uses RoBERTa to evaluate the posterior probability of a candidate word [19].This allows us to leverage the pre-training capabilities of RoBERTa and inserts high-quality words that support the context of the overall explanation.Furthermore, inserting a <MASK> token before a phrase prevents breaking other phrases within the explanation, thus preserving their fluency.
The deletion operation deletes the randomly selected phrase.For the reorder operation we randomly select one phrase, which we call reorder phrase, and randomly select  phrases, which we call anchor phrases.
We reorder each anchor phrase with a reorder phrase and obtain  candidate sequences.We feed these candidates to GPT2 and select the most fluent candidate based on the fluency score given by Eq. 1.

Scoring
Functions.The fluency score (   ) measures the language fluency of a candidate sequence.We use pre-trained GPT2 model [28].We use the joint likelihood of candidate p  : In Sec.6.1 we evaluate the achieved fluency of the generated explanations through human evaluation.Additionally, as the fluency score measures the likelihood of the text according to GPT2, which is trained on 40GB of Internet text, we assume that complex text that is not common or is not likely to appear on the Internet, would also have lower fluency score.Hence, we expect that improving the fluency of an explanation, would lead to explanations that are more easily understood.We evaluate the latter in Sec.5.1 through the automated readability scores.
Length score (  ) This score encourages the generation of shorter sentences.We assume that reducing the length of the generated explanation is also beneficial for improving the readability of the explanation as it promotes shorter sentences, which are easier to read.It is proportional to the inverse of the sequence length, i.e., the higher the length of a candidate sentence, the lower its score.To control over-shortening, we reject explanations with fewer than 40 tokens.
The number of tokens is a hyperparameter that we chose after fine-tuning on the validation split.
For semantic preservation, we compute similarities at both word and explanation level between our source explanation (p  ) and candidate sequence (p  ) at time-step .The word-level semantic scorer evaluates the preserved amount of keyword information in the candidate sequence.Similarly to Li et al. [19], we use RoBERTa (R) [22], a pre-trained masked language model, to compute a contextual representation of the ith word in an explanation as R(p   , p  ).Here, p  = (p  1 . . .p   ) is an input sequence of words.We then extract keywords from p  using Rake [30] and compute a keyword-level semantic similarity score: which is the lowest cosine similarity among all keywords i.e. the least matched keyword of p  .
The keyword-level semantic similarity accounts for preserving the semantic information of the separate keywords used in the text.It is, thus, not affected by changes in words that do not bear significant meaning for the overall explanation.However, as this semantic similarity is performed at keyword-level it does not account for preserving the overall meaning of the text and the context that the keywords are used in.
Hence, we also employ a explanation-level semantic preservation scorer that measures the cosine similarity of two explanation vectors, where the explanation vectors are explanation encodings that contain the overall semantic meaning of the explanation: We use SBERT [29] for obtaining embeddings for both p  , p  .Our overall semantic score is the product of the word level and the explanation level semantics scores: where , and  are hyperparameter weights for the separate scores.We evaluate the semantic preservation of the post-edited explanations with automated ROUGE scores ( §5.2) and manual human annotations ( §6.1, §6.2).
Lastly, the Named Entity (NE) score (  ) is an additional approximation we include as a measure of meaning preservation, since NEs hold the key information within a sentence.We first identify NEs using an off-the-shelf entity tagger1 and then count their number in a given explanation.
Our overall scoring function is the product of individual scores, where , , and  are hyperparameter weights for the different scores: For each edit operation, we use a separate threshold value   .  allows controlling specific operations where   < 1 allows the selection of candidates (p  ) which have lower scores than p −1 .We tune all hyperparameters, including   , , etc., using the validation split of the LIAR-PLUS dataset.

Grammatical Correction.
Once the best candidate explanation is selected, we feed it to a language toolkit2 , which detects grammatical errors like capitalization and irrelevant punctuation, and returns a corrected version of the explanation.Furthermore, to ensure that we have no incomplete sentences, we remove sentences without verbs in the explanation.These two steps further ensure that the generated explanations are fluent (further evaluated in Sec.6.1).

3.2.5
Paraphrasing.Finally, to improve fluency and readability further, we use Pegasus [40], a model pre-trained with an abstractive text summarization objective.It focuses on relevant input parts to summarise the input semantics in a concise and readable way.Since we want our explanations to be both fluent and human-readable, we leverage Pegasus without fine-tuning on downstream tasks.This way, after applying our iterative edit-based algorithm with grammatical error correction and paraphrasing, we obtain explanations that are fluent, coherent, and human readable.

Datasets
We use two fact checking datasets, LIAR-PLUS [38] and PubHealth [17].These are the only two available real-world fact checking datasets that provide short veracity justifications along with claims, RCs, and veracity labels.Table 1 provides the size for each of the splits in the corresponding dataset.The labels used in LIAR-PLUS are {true, false, half-true, barely-true, mostly-true, pants-on-fire}, and in PubHealth, {true, false, mixture, unproven}.While claims in LIAR-PLUS are only from PolitiFact, PubHealth contains claims from eight fact checking sources.PubHealth has also been manually curated, e.g., to exclude poorly defined claims.Finally, the claims in PubHealth are more challenging to read than those in LIAR-PLUS and other real-world fact checking datasets.

Models
Our experiments include the following models; their hyperparameters are given in Appendix C.
(Un)Supervised Top-N extracts sentences from the RCs, which are later used as input to our algorithm.The sentences are extracted in either a supervised or unsupervised way ( §3.1).
(Un)Supervised Top-N+Edits-N generates explanations with the iterative edit-based algorithm ( §3.2.3) and grammar correction ( §3.2.4).The model is fed with sentences extracted from RCs in an (un)supervised way.
Atanasova et al. [3] is a reference model that trains a multi-task system to predict veracity labels and extract explanation N sentences, where N is the average number of the sentences in the justifications of each dataset.Kotonya and Toni [17] is a baseline model that generates abstractive explanations with an average sentence length of 3.
Lead-K [26] is a common lower-bound baseline for summarisation models.It selects the first K sentences of the RCs.

Evaluation Overview
We perform both automatic and manual evaluations of the models above.We include automatic measures for assessing readability ( §5.1).While the latter was not included in prior work, we consider readability an essential quality of an explanation, and thus report it.We further include automatic ROUGE F1 scores (overlap of the generated explanations with the gold ones, §5.2) for compatibility with prior work and to ensure that our generated explanations don't shift much from the gold ones.In particular, we are interested whether the reported ROUGE scores for the post-edited explanations are not significantly different from the ROUGE scores of the non-edited explanations, which would indicate a preservation of the original content important for the fact check.We note, however, that the employed automatic measures are limited as they are based on word-level statistics.Especially ROUGE F1 scores should be taken with a grain of salt, as only exact matches of words are rewarded with higher scores, where paraphrases or synonyms of words in the gold summary are not scored.Hence, we also conduct a manual evaluation following Atanasova et al. [3] to further assess the quality of the generated explanations with a user study.As manual evaluation is expensive to obtain, the latter is, however, usually estimated based on small samples.

AUTOMATIC EVALUATION
As mentioned above, we use ROUGE F1 scores to compute overlap between the generated explanations and the gold ones, and compute readability scores to assess how challenging the produced explanations are to read.

Readability Results
Metrics.Readability is a desirable property for fact checking explanations, as explanations that are challenging to read would fail to convey the reasons for the chosen veracity label and would not improve the trust of end-users.To evaluate readability, we compute Flesch Reading Ease [14] and Dale-Chall Readability Score [27].The Flesch Reading Ease metric gives a text a score ∈ [1, 100], where a score ∈ [30,50] requires college education and is difficult to read, a score ∈ (50, 60] requires a 10th to 12th school grade and is still fairly difficult to read, a score ∈ (60, 70] is regarded as plain English, which is easily understood by 13-to 15-year-old students.The Dale-Chall Readability Score uses a specially designed list of words familiar to lower-grade students to assess the number of hard words used in a given text. It gives a text a score ∈ [9.0, 9.9] when it is easily understood by a 13th to 15th-grade (college) student, a score ∈ [8.0, 8.9] when it is easily understood by an 11th or 12th-grade student, a score ∈ [7.0, 7.9] when it is easily understood by a 9th or 10th-grade student.The scores presented are an average over the readability scores for the separate instances in the test split (see supplemental material for results on the validation split).We additionally provide the 95% confidence interval for the average score based on 1000 random re-samples from the corresponding split.Results.Table 2 presents the readability results.We find that our iterative edit-based algorithm consistently improves the reading ease of the explanations by up to 5.16 points, and reduces the grade requirement by up to 0.30 points.

Method
Furthermore, the improvements are statistically significant (<0.05) in both supervised and unsupervised explanations, except for the Dale-Chall score for the LIAR unsupervised explanations, where the 95% confidence interval is still decreased compared to the non-edited explanations.Conducting paraphrasing further improves significantly (<0.05) the reading ease of the text by up to 9.33 points, and reduces the grade requirement by up to 0.48 points.It is also worth noting that the explanations produced by Atanasova et al. [3] as well as the gold justifications are fairly difficult to read and can require even college education for grasping the explanation, while the explanations generated by our algorithm can be easily understood by 13-to 15-year-old students according to the Flesch Reading Ease score.
Overall observations.Our results show that our method makes fact checking explanations less challenging to read and makes them accessible to a broader audience of up to 10th-grade students.

Automatic ROUGE Scores
Table 3. ROUGE-1/2/L F1 scores ( §5.2) of supervised (Sup.) and usupervised (Unsup.)methods over the test splits (for validation and ablations, see the appendix).In italics, we report results reported from prior work, where we do not always have the outputs to compute the confidence intervals.Underlined ROUGE scores of the Top-N+Edits-N and Top-N+Edits-N+Para are statistically significant ( < 0.05) compared to the input Top-N ROUGE scores, N={5,6}.
These account for n-gram (1/2) and longest (L) overlap between generated and gold justification.The scores are recall-oriented, i.e., they calculate how many of the n-grams in the gold text appear in the generated one.
Caveats.Here, ROUGE scores are used to verify that the generated explanations preserve information important for the fact check, as opposed to generating completely unrelated text.Thus, we are interested in whether the ROUGE scores of the post-edited explanations are close but not necessarily higher than those of the selected sentences from the input RCs.It is worth noting, we include paraphrasing and insertion of new words to improve explanation's readability, which, while bearing the same meaning, necessarily results in lower ROUGE scores.
Results.In Table 3, we present the ROUGE score results.First, comparing the results for the input Top-N sentences with the intermediate and final explanations generated by our system, we see that, while very close, the ROUGE scores tend to decrease.For PubHealth, we also see that the intermediate explanations always have higher ROUGE scores compared to the final explanations from our system.These observations corroborate two main assumptions about our system.First, our system manages to preserve a large portion of the information important for explaining the veracity label, which is also present in the justification.This is further corroborated by observing that the decrease in the ROUGE scores is often not statistically significant ( < 0.05, except for some ROUGE-2 and one ROUGE-L score).
Second, the operations in the iterative editing and the subsequent paraphrasing allow for the introduction of novel n-grams, which, while preserving the meaning of the text, are not explicitly present in the gold justification, thus, affecting the word-level ROUGE scores.We further discuss this in Sec.7 and the appendix.
The ROUGE scores of the explanations generated by our post-editing algorithm when fed with sentences selected in an unsupervised way are considerably lower than with the supervised models.The latter illustrates that supervision for extracting the most important sentences is important to obtain explanations close to the gold ones.Finally, the systems' results are mostly above the LEAD-N scores, with a few exceptions for the unsupervised explanations for LIAR-PLUS.
Overall observations.We note that while automatic measures can serve as sanity checks and point to major discrepancies between generated explanations and gold ones, related work in generating fact checking explanations [3] has shown that the automatic scores to some extent disagree with human evaluation studies, as they only capture word-level overlap and cannot reflect improvements of explanation quality.Human evaluations are therefore conducted for most summarisation models [11,35], which we include in Sec. 6.

MANUAL EVALUATION
As automated ROUGE scores only account for word-level similarity between the generated and the gold explanation, and the readability scores account only for surface-level characteristics of the explanation, we further conduct a manual evaluation of the quality of the produced explanations.

Explanation Quality
We manually evaluate two explanations: the input Top-N sentences, and the final explanations produced after paraphrasing (Edits-N+Para).We perform a manual evaluation of the test explanations obtained from supervised selection for both datasets with two annotators for each.Both annotators have a university-level education in English.
Metrics.We show a claim, veracity label, and two explanations to each annotator and ask them to rank the explanations according to the following criteria.Coverage means the explanation contains important and salient information for the fact check.Non-redundancy implies the explanation does not contain any redundant/repeated/not relevant information to the claim and the fact check.Non-contradiction checks if there is information contradictory to the fact check.Fluency measures the grammatical correctness of the explanation and if there is a coherent story.
Overall measures the overall explanation quality.We allow annotators to give the same rank to both explanations [3].
We randomly sample 40 instances3 and do not provide the annotators with information about the explanation type.
Results.Table 4 presents the human evaluation results for the first task.Each row indicates the annotator number and the number of times they ranked an explanation higher for one criterion.Our system's explanations achieve higher acceptance for non-redundancy and fluency for LIAR-PLUS.The results are more pronounced for the PubHealth dataset, where our system's explanations were preferred in almost all metrics by both annotators.We hypothesise that PubHealth being a manually curated dataset leads to overall cleaner post-editing explanations, which annotators prefer.

Explanation Informativeness
Metrics.We also perform a manual evaluation for veracity prediction.We ask annotators to provide a veracity label for a claim and an explanation where, same as for the evaluation of Explanation Quality, the explanations are either our Table 4. Manual annotation results of explanation quality with two annotators for both datasets.Each value indicates the relative proportion of when an annotator preferred a justification for a criterion.The preferred method, out of the input Top-N (Supervised) and the output of our method, Top-N+Edits-N+Para, is emboldened, Both indicates no preference.
system's input or output.The annotators provide a veracity label for three-way classification; true, false, and insufficient (see map to original labels for both datasets in Appendix 8).We use 30 instances of explanation type and perform evaluation for both datasets with two annotators for each dataset and instance.
Results.For LIAR-PLUS, one annotator gave the correct label 80% times for input and 67% times for the output explanations.The second annotator chose the correct label 56% times using output explanations and 44% times using input explanations.However, both annotators found at least 16% of explanations to be insufficient for veracity prediction (Table 5).For PubHealth, both annotators found each explanation to be useful for the task.The first annotator chose the correct label 50% & 40% of the times for the given input & output explanations.The second annotator chose the correct label in 70% of the cases for both explanations.This corroborates that for a clean dataset like PubHealth our explanations help for the task of veracity prediction.

DISCUSSION
Results from our automatic and manual evaluation suggest two main implications of applying our post-editing algorithm over extracted RCs.First, with the automatic ROUGE evaluation, we confirmed that the post-editing preserves a large portion of important information that is contained in the gold explanation and is important for the fact check.This was further supported by our manual evaluation of veracity predictions, where the post-edited explanations have been most useful for predicting the correct label.We conjecture the above indicates that our post-editing can be applied more Top-5: Heavily-armed Muslims shouting "Allahu Akbar" open fire campers and hikers in a park.A heavily armed group of Middle Eastern looking Muslim men was arrested outside Los Angeles after opening fire upon hikers and campers in a large State Park in the area .There was no evidence found that a crime had been committed by any of the subjects who were detained and they were released .Also, the police report described the men only as " males , " not "Middle Eastern males " or "Muslim males ." The website that started this rumor was Superstation95, which is not a "superstation" at all but rather a repository of misinformation from Hal Turner, who in 2010 was sentenced to 33 months in prison for making death threats against three federal judges.No credible news reports made any mention of the "Allahu Akbar" claim, and no witnesses stated they had been "shot at" by the men while hiking or camping.Top-5+Edits-5: Heavily-armed Muslims males shouting "Allahu Akbar" open fire in a park.A heavily armed group of Middle Eastern looking Muslim men was arrested after opening fire upon hikers and campers in a large State Park outside Los Angeles.
There was no evidence found that a crime had been committed by any of the subjects on campers and hikers .Also, the police report described the men only as "," not "Middle Eastern" or "Muslim."The website that started this rumor was Superstation95, which is not a "superstation" at all but rather a repository of misinformation from Hal Turner, who in 2010 was sentenced to 33 months in prison for making death threats against three federal judges .No credible news reports made any mention of the "Allahu Akbar" claim, and no witnesses stated they had been "shot at".Top-5+Edits-5+Para: Muslims shout "Allahu Akbar" open fire in a park.A heavily armed group of Middle Eastern looking Muslim men was arrested after opening fire on hikers and campers in a large State Park outside Los Angeles.There was no evidence that a crime had been committed by any of the campers or hikers.The website that started this rumor was Superstation95, which is not a "superstation" at all but rather a repository of misinformation from Hal Turner, who in 2010 was sentenced to 33 months in prison.There were no credible news reports that mentioned the Allahu Akbar claim, and no witnesses that said they had been shot at.
Original Explanation: Secondary reporting claiming that Muslim men fired upon hikers (and that the media covered it up) appeared on a site that had previously inaccurately claimed Illinois had applied Sharia law to driver's licenses, that Target introduced "Sharia-compliant" checkout lanes, and that Muslims successfully banned Halloween at a New Jersey school.Claim: The media covered up an incident in San Bernardino during which several Muslim men fired upon a number of Californian hikers.Label: False Top-5: The article claims the CDC might have to stop calling COVID-19 an epidemic because the death rate is becoming so low that it wouldn't meet the CDC's definition of epidemic.The latest CDC statement made public when the Facebook post was made said deaths attributed to COVID-19 decreased from the previous week, but remained at the epidemic threshold, and were likely to increase .Moreover death rates alone do not define an epidemic.Amid news headlines that the United States set a daily record for the number of new coronavirus cases, an article widely shared on Facebook made a contrarian claim .The CDC page says: "Epidemic refers to an increase, often sudden, in the number of cases of a disease above what is normally expected in that population in that area." Top-5+Edits-5: The article claims the CDC might have to stop calling COVID-19 an epidemic in that population because the death rate is becoming so low that it wouldn't meet the CDC's definition of epidemic.The latest CDC statement an article from the previous week said deaths decreased, but.Moreover, death rates do not define an epidemic.Amid news headlines that the United States on Facebook set a daily record for the number of new coronavirus cases, The CDC page when the Facebook post was made says: "Epidemic refers to an increase, often sudden, in the number of cases of a disease attributed to COVID-19." Top-5+Edits-5+Para: According to the article, the CDC might have to stop calling COVID-19 an epidemic because the death rate is so low that it wouldn't meet their definition of an epidemic.An article from the previous week said deaths decreased, but that's what the latest CDC statement says.Death rates do not define an epidemic.The CDC's page on Facebook says Epidemic refers to an increase, often sudden, in the number of cases of a disease attributed to COVID-19.
Original Explanation: Despite a dip in death rates, which are expected to rise again, the federal Centers for Disease Control and Prevention still considers COVID-19 an epidemic.Death rates alone don't determine whether an outbreak is an epidemic.Claim: The CDC may have to stop calling COVID-19 an 'epidemic' due to a remarkably low death rate.Label: False Table 6.Example explanations -extracted Top-5 RCs, the iterative editing, and the latter with paraphrasing on top, taken from the test split of PubHealth.Each color designates an edit operation -reordering, deletion, and paraphrasing.The underlining designates the position in the text where the corresponding operation will be applied in the next step -post-editing and paraphrasing.
generally for automated summarisation for knowledge-intensive tasks, such as fact checking and question answering, where the information needed for prediction has to be preserved.
Second, with both the automatic and manual evaluation, we also corroborate that our proposed post-editing method improves several qualities of the generated explanations -fluency, conciseness, and readability.The latter supports the usefulness of the length and fluency scores as well as the grammatical correction and the paraphrasing steps promoting these particular qualities of the generated explanations.Fluency, conciseness, and readability are important prerequisites for building trust in automated fact checking predictions especially for systems used in practice as Thagard [36] find that people generally prefer simpler, more general explanations with fewer causes.They can also contribute to reaching a broader audience when conveying the veracity of the claim.Conciseness and readability are also the downsides of current professional long and in-depth RCs, which some leading fact checking organisations, e.g., PolitiFact, 4 have slowly started addressing by including short overview sections for the RCs.
Table 6 further presents a case study from the PubHealth dataset.Overall, the initial extracted RC sentences are transformed to be more concise, fluent and human-readable by applying the iterative post-editing algorithm followed by paraphrasing.We can also see that compared to the original explanation, the post-edited explanations contain words that do not change the semantics of the explanation, but would not be scored as correct according to the ROUGE scores.For example, in the second instance, "Death rates do not define an epidemic" in the post-edited explanation and "Death rates alone don't determine whether an outbreak is an epidemic" from the original explanation express the same meaning, but contain both paraphrases and filler words that would decrease the final ROUGE scores.Finally, compared to the original explanation, the post-edited explanations for both instances have preserved the information needed for the fact checking.

CONCLUSION
In this work, we present an unsupervised post-editing approach to improve extractive explanations for fact-checking.
Our novel approach is based on an iterative edit-based algorithm and rephrasing-based post-processing.In our experiments on two fact checking benchmarking datasets, we observe, in both the manual and automatic evaluation, that our approaches generate fluent, coherent, and semantics-preserving explanations.

B AUTOMATIC EVALUATION
In Table 11 and Table 10, we provide results over the validation split of the datasets for the ROUGE and readability automatic evaluation.We additionally provide ablation results for components of our approach.First, applying Pegasus directly on the extracted sentences preserves a slightly larger amount of information when compared to applying Pegasus on top of the iterative editing approach -up to 0.96 ROUGE-L scores, but the readability scores are still lowerup to 4.28 Flesch Reading Ease points.We also show results of the two parts included in the Edits step -the iterative editing and the grammar correction.We find that the grammar correction improves the ROUGE scores with up to 8 ROUGE-L score points and up to 8 Flesch Reading Ease points.

C EXPERIMENTAL SETUP C.1 Selection of Ruling Comments
For the supervised selection of RCs, as described in Section 3.1, we follow the implementation of the multi-task model of Atanasova et al. [3].For LIAR-PLUS, we don't conduct fine-tuning as the model is already optimised for the dataset.
For PubHealth, we change the base model to SciBERT, as the claims in PubHealth are from the health domain and previous work [17] has shown that SciBERT outperforms BERTs for the domain.In Table 7, we show the results for the fine-tuning we performed over the multi-task architecture with a grid-search over the maximum length limit of the text and the weight for the positive sentences in the explanation extraction training objective.We finally select and use explanations generated with the multi-task model with a maximum text length of 1700, and a positive sentence weight of 5.
For the unsupervised selection of explanation sentences, we employ a Longformer model.We construct the Longformer model with BERT as a base architecture and conduct 2000 additional fine-tuning steps for the newly added cross-attention weights to be optimised.We then train models for both datasets supervised by veracity prediction.The most salient sentences are selected as the sentences that have the highest sum of token saliencies.
Finally, we remove long sentences and questions from the RCs, where the ROUGE score changes after filtering are illustrated in Table 8, which results in the Top-N sentences, that are used as input for the post-editing method.
These experiments were run on a single NVIDIA TitanRTX GPU with 24GB memory and 4 Intel Xeon Silver 4110 CPUs.Model training took ∼ 3 hours.

C.2 Iterative Based Algorithm
We used the validation split of LIAR-PLUS to select the best hyperparameters for both datasets.We use the weight of 1.5, 1.2, 1.4, 0.95 for , , ,  and 1.0 for  in our scoring function.We set the thresholds as 0.94 for reordering, 0.97 for deletion, and 1.10 for insertion.We keep all models -GPT-2, RoBERTa, and Pegasus, fixed and do not finetune them on any in-house dataset.We run our search algorithm on a single V100-32 GB GPU for 220 steps, which takes around 13 hours for each split for both datasets.

D NOVELTY AND COPY RATE
Table 9 presents additional statistics for the generated explanations from the test sets of both datasets.First, we compute how many of the words from the input Top-N RCss are preserved in the final explanation.We find that with the final step of the post-editing process, up to 8% of the tokens from the RCs are not found in the final explanation.On the

Table 1 .
Size of the fact checking datasets used in this work ( §4.1).3.2.3 IterativeEdit-based Algorithm.Given input explanations, our algorithm iteratively performs edit operations for  steps to search for a highly scored candidate (p  ).At each search step, it computes scores for the previous sequence (p −1 ) and candidate sequence using Eq. 5.It selects p  if its score is larger than p −1 by a multiplicative factor   :

Table 9 .
Copy rate from the Ruling Comments, Novelty w.r.t the Ruling comments, and Coverage % of words in the explanation that are found in the justification.