Towards Generative Event Factuality Prediction

We present a novel end-to-end generative task and system for predicting event factuality holders, targets, and their associated factuality values. We perform the first experiments us-ing all sources and targets of factuality statements from the FactBank corpus. We perform multi-task learning with other tasks and event-factuality corpora to improve on the FactBank source and target task. We argue that careful domain specific target text output formatting in generative systems is important and verify this with multiple experiments on target text output structure. We redo previous state-of-the-art author-only event factuality experiments and also offer insights towards a generative paradigm for the author-only event factuality prediction task.


Introduction
The term factuality refers to the author's or speaker's presentation of an event as factual, i.e. as an event that has happened, is happening, or will happen.Often times, an author does not only talk about what they believe is factual, but also about what others believe is factual.Thus, when a speaker presents an event, they communicate their view of the factuality of the event, and they can also at the same time attribute a factuality judgment about the same event to another source.Over the past 15 years, the task of event factuality prediction has received a lot of attention, but only in predicting the factuality of an event according to the author's presentation.Multiple corpora have been created alongside multiple machine learning architectures which solely focus on predicting the author's presentation of factuality.
An exception is the FactBank corpus (Saurí and Pustejovsky, 2009), which not only annotates the author's presentation of factuality, but also annotates the nested sources assigning factuality values to events in text.In this paper, our goal is to predict the presentation of factuality of the nested sources mentioned in a text alongside their target events.We choose the FactBank corpus (Saurí and Pustejovsky, 2009) as it is the only corpus annotating nested source factuality and it is carefully annotated and constructed.We attempt combinations with other corpora, namely author-only event factuality corpora and source and target cognitive state corpora, to improve on predicting nested source and target factuality.We perform all of these experiments with a novel generative approach and create a new version of the event factuality prediction task.
There are four main contributions of this work: (i) We are the first to present a subset of the Fact-Bank dataset containing nested source and target factuality.This allows us to define two related tasks with associated datasets, source-and-target factuality and author-only factuality.We create a database of the complex FactBank corpus for public release.
(ii) We are the first to present a generative machine learning architecture for the factuality prediction task.We perform multiple experiments with factuality structure and target generated text structure, and offer insights into how to frame the event factuality prediction task as a text generation task.
(iii) We perform multi-task learning to improve on both factuality tasks.We offer a detailed evaluation of what combinations work and why.
(iv) We achieve state-of-the-art results in an end-toend setting for the FactBank source-and-target and author-only factuality tasks.
We first present the problem we are solving (Section 2).We then present a survey of previous work (Section 3).In Section 4, we present the FactBank database architecture.Section 5 details our generative experimental details and modeling framework.Finally, in Sections 6 and 7 we report experiments on the FactBank source-and-target and author-only tasks, respectively.

Background and Motivation
To understand the notion of factuality, consider the following sentence from the FactBank corpus (we have replaced a pronoun for clarity in this exposition).This sentence reports on three events: a selling event, a saying event, and a doubling event.Note that, in this paper, we are not interested in temporal relations, and the notion of factuality applies independently of whether an event is in the past, happening at utterance time, or in the future.
(1) Michael Wu sold the property to five buyers and said he'd double his money.
We can identify four different factuality claims in this sentence: 1.The author is presenting the selling event as factual, i.e., they are committed to the selling event having happened.
2. The author is presenting the saying event as factual, i.e., they are committed to the saying event having happened.
3. The author is presenting the doubling event as having an unknown factuality.
4. The author is presenting Michael Wu as presenting the doubling event as factual, i.e., according to the author, Michael Wu is committed to the doubling event happening.
The first three are claims from the author's perspective, while the last one is from Wu's perspective.We refer to the bearer of the perspective as the source, and the event (or state) that the factuality judgment is about as the target.FactBank, following MPQA (Wiebe et al., 2005a;Deng and Wiebe, 2015), represents the source of a factuality judgment as an ordered list of sources, since the sentence does not directly tell us about Michael Wu's factuality judgment, but rather the author's claim about Michael Wu's factuality judgment.In this paper, we do not address the explicit reconstruction of such attribution chains.
In the above example, we have seen two factuality values: certain factual, and unknown.We can identify additional values by allowing for noncertain factuality (something may have happened), 1 1 FactBank divided this category into the probable and the possible, but this leads to data fragmentation, and it can also be hard for humans to distinguish these two cases.In NLP, there is a distinct task of determining whether a statement is true or not (fact checking).Unfortunately, this other task is sometimes also called "factuality prediction" (see, for example, (Baly et al., 2018)).The difference is that we are interested in how the author presents the event, not ground truth.So despite the same or similar name, there are two different tasks and we only deal with the presentation task, not the ground truth task.

Related Work
Author-Only Factuality Corpora All eventfactuality corpora focus on the presentation of factuality according to the author of the text, with the exception of FactBank, which also annotates the factuality of the mentioned sources besides the author.These corpora include LU (Diab et al., 2009), UW (Lee et al., 2015), LDCCB (LDC) (Prabhakaran et al., 2015), MEANTIME (MT) (Minard et al., 2016), MegaVeridicality (MV) (White et al., 2018), UDS-IH2 (UD2) (Rudinger et al., 2018), CommitmentBank (CB) (De Marneffe et al., 2019), and RP (Ross and Pavlick, 2019).These corpora mainly differ as to what is defined as an annotatable event, the genre of the text, the type of annotators, and the annotation scale.These corpora were unified under a continuous annotation scale in the range [-3, 3] by Stanovsky et al. (2017) (though the author-only factuality value in FactBank was misinterpreted, see (Murzaku et al., 2022) for details).
FactBank The main focus of this paper is the FactBank corpus, which annotates all events introduced in a corpus of exclusively newswire text.The FactBank corpus not only annotates the factuality presented by the author of a text towards an event, but also the factuality of events according to their presentation by sources mentioned inside of the text.Saurí and Pustejovsky (2012) were the first to investigate and perform experiments on the source and target annotations in FactBank.However, we cannot perform an apples-to-apples comparison, as their system neither recognizes events nor identifies sources mentioned in the text.Rather, in their evaluation, this information was created from manual annotation, fed to the system, and then tested on the whole FactBank corpus.
We choose to focus on FactBank because of its expert-level annotations and its detailed source and target annotations.Because of the complexity of the FactBank corpus, we build a robust and efficient database representation of FactBank, which includes all sources including the author, the targets of the factuality attributions, and their respective relations.
Machine Learning Architectures All previous approaches on the event-factuality prediction task use author-only corpora and predict factuality according to the author of the text.Early approaches to the event factuality prediction task used rulebased systems or lexical and dependency tree based features (Nairn et al., 2006;Lotan et al., 2013).Expanding on these rule-based approaches, other work on the event factuality prediction task used SVMs alongside these dependency tree and lexical based features (Diab et al., 2009;Prabhakaran et al., 2010;Lee et al., 2015;Stanovsky et al., 2017).Early neural work includes LSTMs with multi-task or single-task approaches (Rudinger et al., 2018) or using BERT representations alongside a graph convolutional neural network (Pouran Ben Veyseh et al., 2019).Jiang and de Marneffe (2021) expand on these previous works by using other event factuality corpora in multiple training paradigms while also introducing a simpler architecture.These previous neural approaches evaluate on Pearson correlation and mean absolute error (MAE).In previous work, we provide the first end-to-end evaluation using F-measure of the author-only event factuality prediction task (Murzaku et al., 2022).
Our work differs from the previous work in two major ways: first, we are the first to provide a novel and end-to-end generative approach for the event factuality prediction tasks (both author-only and source-and-target).Furthermore, besides our own previous work (Murzaku et al., 2022), all previous works assumed gold event heads.Our system is by default end-to-end, making it usable in real world applications.Second, we perform experiments on the nested sources and target event's factuality, while other works only focused on the presentation of factuality according to the author.
ABSA and ORL Two tasks close in formulation to our task and from which we adopt ideas and insights are the aspect-based sentiment analysis (ABSA) task and the opinion role labelling task (ORL).Peng et al. (2020) create the aspect sentiment triplet extraction task to predict triplets consisting of aspects, opinions, and sentiment polarity.Zhang et al. (2021) are the first to use a generative approach for ABSA fine-tuning on T5.Expanding on this, Gao et al. (2022) achieve state-of-the-art results on all ABSA corpora using a multi-task learning approach through task-specific prompts.The ORL task aims to discover opinions, the sources of opinions, and the associated targets of opinions using the MPQA 2.0 corpus (Wiebe et al., 2005b).Xia et al. (2021) build an end-to-end system creating span representations and using a multi-task learning framework.They achieve state-of-the-art results in the end-to-end setting on the exact match F1 metric.

FactBank Database
We present a generalized database structure for capturing cognitive states expressed in language.The goal is to unify multiple annotated corpora in one format, and to make it simple for users to extract the information they need in various formats.In this paper, we describe only how we use it to hold the annotations of event factuality corpora, and of FactBankin particular, whether in the authoronly perspective or source-and-target perspective.However, given the diversity of corpora, with each corpus having its own focus, annotation rules, and annotation styles, our database structure is sufficiently broad and abstract to accommodate various corpora equally well and yet to preserve the rich-ness of information that each corpus offers, so as to facilitate combining corpora in future experiments with as little data loss as possible.Our goal of preserving the distinct details of individual corpora serves as a step in the direction of bringing human knowledge to bear upon otherwise black-box machine learning techniques.
As an example, consider the FactBank and LU (Diab et al., 2009) corpora.The LU data was published as GATE-formatted XML files with annotation targets and annotations given in XML elements, whereas FactBank was published as a set of text files, each of which represents a relation in what amounts to a relational database.From both of these data sources, we may want to construct, for each training and testing example, a set of triples (sentence, target-marked-elements, label), where target-marked-elements are the tokens of the sentence that describe the target of the factuality judgment by the author, and to which the label refers.If we used the original FactBank release and created a database from it, eliciting triples satisfactory for machine learning wouldl require a complex query with many joins and filters.This is because the structure of the FactBank (implicit) database is oriented toward event-time relations rather than factuality labeling.Accordingly, we designed a new database structure more amenable to queries to support machine learning and developed code to translate corpora including FactBank into this database model.
Database Structure To build the unified database, we needed a stable, fast, and lightweight tool.Python's extensive library support for SQLite database interactions fit those requirements.The unified database's schema is composed of four tables: sentences, mentions, sources, and attitudes.We provide a graphic of the database schema in Appendix C.
The sentences table stores each sentence and any relevant identifying metadata.Thus far, we have not encountered any corpora with suprasentential information encoded as labels.In principle, however, this table can be refactored to accommodate possible future suprasentential information.
Elements within each sentence marked for labeling are stored in the mentions table, with an entry being composed of the surface text of the element, which may be one or more tokens, and its character offset within the sentence.Each sentence may contain more than one marked element.
The sources table represents not only sources but their possible nested relations within sentences.These "according-to" relations form a list, as in Mary said that John said that Jane was coming to dinner.Here, the embedded source for the coming event is (Author → Mary → John).These "according-to" relations may form a tree, as in Mary said that John said that Jane was coming to dinner, but Bob said that she was not.Here, the embedded source for the coming event is (Author → Mary → John).The author may have more than one child source, as in Mary said that John was coming to dinner, but Bob said that John was staying home.Here, we have (Author → Mary) as source for the coming event, and (Author → Bob) as source for the staying event.
Each sentence may have more than one source, but each source has at most one mention.The implied author has no mention, and a named source mentioned repeatedly is listed once for each mention, since we do not apply anaphora resolution.
Finally, the attitudes table aggregates a sentence, its marked elements, and the factuality or sentiment label; the table accommodates both labels but could be refactored to support further label types.Each source may have a distinct attitude toward each of several targets, and each target may have more than one source with its own attitude toward that target.Thus, each source-target pair drawn from mentions has a single listing in attitudes.
Using event factuality corpora annotated on source-and-target factuality is inherently complex and requires structure induction, source linking, and complex database-like operations.Our database structure is an initial step to address the complexity of corpora while also making easyto-use software for corpus projection and conversion.Our database for FactBank is available at https://github.com/t-oz/FactBankUniDB.

Task Definitions and Data
Source and Target Factuality (STF) We define the source-and-target factuality task conceptually as the task to generate all (source, target, factuality label) triplets for a given input sentence such that the source is not the author, the factuality label belongs to a categorial scale, and the source views the target with the given factuality label.Author-Only Factuality (AOF) We define the author-only factuality task conceptually as the task to generate all (event, factuality label) pairs for a given input sentence such that the factuality label belongs to a categorial scale, and the author views the target with the given factuality label.
For each task, we have created a separate disjoint projection from the full FactBank database.We provide information about these projections in Table 3.

Representation of Factuality
Previous work represented factuality on a continuous [-3, 3] scale or directly used the categorial factuality labels used in FactBank.We convert the categorial and numerical representation of FactBank to words.We use the word values shown in Table 1 for all experiments containing factuality values, as using the words leads to better task-specific embeddings therefore leading to better performance (on average 5% for our baseline FactBank source and target experiments).

Input/Output Formats
We define our input x as the raw text and prepend a task prefix p depending on the task of choice.We use a distinct task prefix for each task so that the backbone language model can distinguish between different tasks.For each sub-task that we perform, we define separate target output formats.
Tuple Representation We represent the target as tuples.We use example (1) above to show how this data is represented.For the STF task, the output is Note that this in-line annotation format does not work for the STF task, because it relates two dis-tinct sentence elements to a factuality value.

Flan-T5
For all experiments, we use the encoder-decoder pre-trained Flan-T5 model (Chung et al., 2022).The Flan-T5 model yields significant improvements on many tasks over the T5 model (Raffel et al., 2020) by adopting an instruction fine-tuning methodology.By formulating the STF and AOF tasks as a text generation task, we can create endto-end models without a task-specific architecture design.

Multi-task Learning
Models like T5 and Flan-T5 are multi-task in nature by the pre-training objectives.In the pre-training of T5 (Raffel et al., 2020), T5 was trained with a mixture of tasks separated by task specific prefixes.We perform multi-task learning experiments by prepending task specific prefixes for each task as mentioned in Section 5.1.Furthermore, we also perform proportional mixing to sample in proportion to the dataset size.

Experiments: Source and Target
In this section, we perform experiments on the STF task.We evaluate exclusively on FBST.Our goal is to achieve the best results on this projection of the corpus.

Experimental Setup
Datasets and Target Structure We first offer baselines on the FactBank source and target projection (FBST henceforth).We then perform experiments on the target output structure to determine how much influence this has on results.Finally, we perform multi-task learning experiments with the author-only projection of FactBank, CB (De Marneffe et al., 2019), MPQA (Wiebe et al., 2005b), and UW (Lee et al., 2015).All experiments are performed using the STF paradigm defined in Section 5.1, where our task is to generate lists of triplets of format (source, target, factuality label).
Evaluation Our main method of evaluation is the exact match F1 metric.With this metric, a prediction is only correct if all three elements of the triplet match.This metric is directly equivalent to micro-f1 but we refer to it as the exact match F1 in this paper.Furthermore, to assess how much each corpus combination is contributing to the source and target matching of the triplet, we offer F1 scores for the source, target, and the source and target combination.
Experiment Details We use a standard finetuning approach on Flan-T5.We fine-tune our models for at most 10 epochs with a learning rate of 3e-4, with early stopping being used if the triplet-F1 did not increase on the dev set.All experiments are averaged over three runs using fixed seeds (7, 21, and 42).We also report the standard deviation over three runs.We leave more experimental details to Appendix B.
Text Normalization Following insights and methodology from Zhang et al. (2021), we apply their text normalization strategy on our experiments (denoted NoN for no normalization, N for normalized).Zhang et al. (2021) found that text normalization helps for detecting aspect and opinion phrases in (aspect, opinion, sentiment) triplets mainly through producing the correct morphology of a word and through addressing orthographic alternatives to words.Their method finds the replacement word from a corresponding vocabulary set using the Levenshtein distance.We note that in our experiments, most of the improvements that normalization yielded were due to correcting morphological errors (e.g.gold is houses, model predicts house) or capitalization errors (gold is Mary, model predicts mary).

Results: Baseline and Target Output Restructuring
Baselines Table 2 shows our baseline results for the FactBank source and target projection.We notice some particular trends in this task and offer insights.First, we see that normalization helps.For our baseline FBST NoN experiment, we report a triplet F1 of 0.472, whereas after normalization, the triplet F1 increases to 0.512.Intuitively, normalization most helps for sources.One of the main benefits of normalization is producing the correct morphology and orthography.We find that Fact-Bank sources are often nouns or proper nouns and normalization ensures the correct orthography.Furthermore, we see that source outperforms target in all cases and that labelling the correct source and target pairs is not a trivial task.These results are similar to Xia et al. (2021) who worked on the MPQA corpus, which annotates opinions (i.e., text passages indicating opinions), sources of opinions, and the targets of these events.The authors found
that matching MPQA sources to opinions is far easier than matching MPQA targets to opinions.

Attribute-Value (AV) Addition
In Table 2, we also report results on experiments where we use the attribute-value (AV) format for the output.This formatting especially helps with disambiguation of the source, targets, and factuality, providing our generative framework deeper contextual understanding and cues for triplet generation.We find that this output format produces large increases in all measures, namely the triplet F1, source F1, and source and target F1.Once again, we see that normalization helps, achieving our highest baseline triplet F1 of 0.535.Because of the success of this target format restructuring (AV) and normalization (N), we perform the remaining experiments in this paper using the AV output format and the normalization step.

Results: Multi-task learning experiments
We perform multi-task learning (MTL) experiments using author-only factuality corpora, opinion role labelling corpora, and the combinations of all of them.Following our approach described in Section 5.4.1, we prepend task specific prefixes for our tasks, such as author only factuality: or opinion role label: .We mirror the format of our FactBank source and target examples for our MTL experiments.For example, when we add in the author-only factuality data, we structure our targets as (target = event, factuality label), mirroring the format of our source and target data.Similarly, for other corpora such as MPQA which only contain source and target information without any factuality labels, we structure our data as (source = opinion source, target = opinion target).We aim to tackle the following with our MTL experiments: first, we aim to improve target identification.Our FBST-only system performs worse on identifying targets than sources.To address this, we combine with author-only event factuality corpora, namely FactBank (denoted FBAO), and CB and UW, which both annotate events in a similar structure and genre as FactBank.Second, we aim to improve source and target linking, as the FBSTonly system cannot perform well on this task.We attempt to address this using the Xia et al. (2021) projection of the MPQA corpus which annotates opinion sources and opinion targets.We also attempt an experiment with a direct mirroring of the source and target representation when using the FactBank author-only data (we denote this representation as FBAO*).Here, we explicitly state the author of the text as a source, structuring our target text to be generated as (source = AUTHOR, target = event, factuality label).Results for our MTL experiments are shown in Table 4.We see that all corpus combinations besides MPQA help for the triplet F1 metric.Most notably, we find that adding the FactBank authoronly data (FBAO) and in particular, the triplet Fact-Bank author-only projection (FBAO*) helps the most, especially for the target and source+target F1.We note though that the triplet F1 results for FBST with FBAO and FBAO* both have rather large standard deviations, so the difference may not be significant.Adding other author-only factuality corpora such as UW and CB help, but not as much as FactBank.We see that CB does not boost performance much on FactBank, and UW actually helps more for the triplet F1 metric.This may be because we are performing a separate task and using a different machine learning paradigm.MPQA does not help for any metric besides the source metric.Opinion role labelling is a separate task and appears to be incompatible with the source and target factuality task.However, we note that MPQA also annotates targets differently from Fact-Bank, which explains why the MTL approach did not help in this case.

Experiments: Author Only
In this section, we perform experiments on the AOF task.We evaluate exclusively on FBAO, performing our experiments with the same model and training paradigm.We use three styles of target representation mentioned in Section 5: one style where we extract event words and their associated factuality values as tuples, an in-line annotation style used by Zhang et al. (2021), and finally a MTL triplet generation task with the source and target projection of FactBank where we generate triplets of format (source = AUTHOR, target = target event, factuality label).Furthermore, we also factor polarity in our experiments.Murzaku et al. (2022) found that separately predicting polarity and factuality for the event factuality task can lead to error reductions since polarity is often expressed independently of the degree of factuality.We treat the addition of polarity as a triplet generation task generating triplets of format (target = target event, factuality label, polarity).We reduce the factuality label to the strength of factuality (true, possibly true, unknown), with the polarity being one of (negative, unknown, positive).

End-to-End Author-Only Factuality
We follow the end-to-end evaluation setup on Fact-Bank as we did in (Murzaku et al., 2022), evaluating on per-label F1 and macro-F1.Because our system is end-to-end, we cannot evaluate on Pearson correlation or MAE like some previous event factuality papers that assumed gold heads.For an apples-to-apples comparison, we use the same label mappings as Murzaku et al. (2022).We average over three runs and also report standard deviation which the previous authors did not report.
Table 5 shows results for our experiments on FactBank author-only (FBAO), FBAO with an inline annotation target format (FBAO-Anno), FBAO as a triplet generation task that includes polarity (Pol), and FBAO finally a MTL triplet generation task with the source and target projection of Fact-Bank, tested on FBAO (FBAO*, FBST).We note the very high standard deviations in the PR+ and PR-measurements; these labels are rare even after collapsing them to the same class, especially in the test set, which explain the extreme standard deviation fluctuations.Our baseline system (FBAO) yields a noticeable increase in the CT+, UU, and CT-labels compared to the baseline, but performs worse on the PR+ and PR-labels.The in-line annotation text generation task performs better on macro-F1 than the baseline tuple generation task, with a notable increase in CT-.Factoring polarity helps as well: for both configurations, factoring polarity leads to an increase and achieves new a SOTA for the PR-label in our FBAO-Anno-Pol setup.Our best performing result is our multi-task learning on FBAO and FBST, where we modify FBAO to include the author as a source in its triplet representation.We achieve new SOTA on macro-f1, a large increase and SOTA on the CT+ label, and SOTA on UU.

FBAO: Exact Match Evaluation
To be able to compare performance on the STF and AOF tasks, we evaluate using the same metric as Section 6, specifically using tuple/triplet exact match precision, recall, F1, and target F1.This evaluation corresponds to a micro-F1, as it does not depend on the factuality value.In this evaluation, we do not consider source F1 or source and target F1 because the source is the author of the text.We aim to quantify how well our generative system performs at generating author-only structures, and therefore evaluate using an exact match evaluation.We are the first to report results on FactBank using an exact match evaluation.
Table 6 shows results for our exact match evaluation on FBAO.We see two clear trends: first, the in-line annotation generation task does not perform as well in our exact match evaluation compared to our tuple/triplet generation task.This makes sense given that the Anno option performs markedly worse on the most common factuality value, CT+, which in the macro-average is compensated by better performances for other values, but in the exact-match evaluation lowers its overall performance.Our best results are produced by our MTL setup with FBAO and FBST(FBAO*, FBST).Similar to our source and target results in Table 4, we see that the AOF task benefits from the FBST data in a MTL setup performing the best once again.We also see, as expected, that the AOF task is easier than the STF task, with a result margin of 13.3% absolute, since fewer details need to be predicted, and since more data is available.

Conclusion
We provide a new generative framework for the event factuality prediction task using Flan-T5 and focusing on output format, individual task prefixes, and multi-task learning.To tackle the complexity of the FactBank corpus, we create a database representation that simplifies extracting sources, targets, and factuality values for all projections of Fact-Bank, which we will publicly release.Our sourceand-target experiments show that careful output formatting can yield improvements (Table 2) and careful attention to multi-task learning mixtures can help (Table 4).We evaluate the author-only event factuality task using both macro-average (Table 5) and exact-match evaluation metrics (Table 6), with as expected different results.We achieve new state-of-the-art results on both source-and-target (because no prior results) and author-only (beating existing results) end-to-end factuality prediction.
Finally, we note that these experiments do not account for potential biases prevalent in fine-tuning large language models.We hypothesize that for some sources in text (i.e.power figures, authorities, or specific names), there may be biases towards certain labels.We will investigate these biases in future work, as an event factuality prediction system with inherent bias can have real world implications.

Ethics Statement
As mentioned in the limitations section, we note that these experiments do not account for potential biases prevalent in fine-tuning large language models.In a real world deployment of our model, we hypothesize that there could be a potential mislabelling of factuality values depending on bias towards sources of utterances.For example, if a power figure states an event, will the event label be more biased towards being factual just because of the source of the statement?We will investigate these questions and issues in future work.
We also note that our paper is foundational research and we are not tied to any direct applications.

A Distribution of Data Set and Database
We intend to distribute the split of the source and target FactBank dataset.We have included the dataset in this submissions for reviewers to inspect, but cannot distribute it due to copyright reasons.Instead, we will release a Python script alongside our SQLite database implementation which will produce the files submitted with this paper with the original FactBank corpus as an input.The Fact-Bank corpus can be obtained by researchers from the Linguistic Data Consortium, catalog number LDC2009T23.
Our dataset split is detailed in Table 3.We split our corpus using the methods as Murzaku et al. (2022), which also includes splitting by article.

B Details on Experiments
We use a standard fine-tuning approach on the Flan-T5-base model with 247,000,000 parameters.For computing, we used our employer's GPU cluster and performed experiments on a Tesla V100-SXM2 GPU.Compute jobs typically ranged from 10 minutes for small single corpus combinations, to 30 minutes for larger multi-task learning corpus combinations.We did not do any hyperparameter search or hyperparameter tuning.
We fine-tuned our models for at most 10 epochs with a learning rate of 3e-4, with early stopping being used if the triplet-F1 did not increase or if the factuality macro-F1 did not increase.All metrics for experiments were averaged over three runs using fixed seeds (7, 21, and 42).We report the average over three runs and the standard deviation over three runs.
For prediction normalization on our fixed experiments setting, we use the editdistance Python package.We provide scripts for our prediction normalization and full evaluation and will be made publicly available.
To fine-tune our models and run experiments, we used PyTorch lightning Falcon et al. (2019) and the transformers library provided by HuggingFace Wolf et al. (2019).All code for fine-tuning, modelling, and preprocessing will be made available.

a
list of triplets: Input: source target factuality: Michael Wu sold the property to five buyers and said he'd double his money.Output: (Wu, double, true) For the AOF task, the output is a list of pairs: Input: author only factuality: Michael Wu sold the property to five buyers and said he'd double his money.Output: (sold, true); (said, true); (double, unknown) Attribute-Value Representation (AV) As an alternative, we structure our target text in an attributevalue pair format.For the STF task, we get: Input: source target factuality: Michael Wu sold the property to five buyers and said he'd double his money.Output: (source = Wu, target = double, true) For the AOF task, we get: Input: author only factuality: Michael Wu sold the property to five buyers and said he'd double his money.Output: (target = sold, true); (target = said, true); (target = double, unknown) Inline Representation (Anno) We also represent the AOF task as in-line annotations in the target text representation, since we can anchor the factuality on the target head word.We follow the same annotation format style as Zhang et al. (2021), as the authors found that this text generation target performs well for tuple data representations.We repeat the example from above in this format: Input: author only factuality: He sold the property to five buyers and said he'd double his money.Output: Michael Wu [sold | true] the property to five buyers and [said | true] he'd [double | unknown] his money.

Figure 1 :
Figure 1: Entity-Relation Diagram of the FactBank Database.Note that the one-to-one notation between mentions and sources only applies to source mentions, not target mentions, which are one-to-many.

Table 2 :
Results on triplet generation evaluated on triplet precision, recall, exact match F1, source F1, target F1, and source and target F1 for the FactBank source and target projection (FBST).NoN denotes no normalization, N denotes normalization, and AV denotes attribute-value structure.A shaded cell indicates the best performing combination; light means only a slight improvement.

Table 3 :
Information on data set sizes

Table 4 :
Results on triplet precision, recall, exact match F1, source F1, target F1, and the source and target F1 for the MTL experiments on generating factuality triplets for the FactBank source and target projection (FBST).A shaded cell indicates state-of-the-art; light means only a slight improvement.