Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)

Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) James Thorne Andreas Vlachos Oana Cocarascu Christos Christodoulopoulos Arpit Mittal November 2018

Brussels, Belgium

Association for Computational Linguistics http://www.aclweb.org/anthology/W18-55 book FEVER:2018 The Fact Extraction and VERification (FEVER) Shared Task JamesThorne AndreasVlachos OanaCocarascu ChristosChristodoulopoulos ArpitMittal Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 1–9 http://www.aclweb.org/anthology/W18-5501 We present the results of the first Fact Extraction and VERification (FEVER) Shared Task. The task challenged participants to classify whether human-written factoid claims could be Supported or Refuted using evidence retrieved from Wikipedia. We received entries from 23 competing teams, 19 of which scored higher than the previously published baseline. The best performing system achieved a FEVER score of 64.21%. In this paper, we present the results of the shared task and a summary of the systems, highlighting commonalities and innovations among participating systems. inproceedings thorne-EtAl:2018:FEVER The Data Challenge in Misinformation Detection: Source Reputation vs. Content Veracity FatemehTorabi Asr MaiteTaboada Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 10–15 http://www.aclweb.org/anthology/W18-5502 Misinformation detection at the level of full news articles is a text classification problem. Reliably labeled data in this domain is rare. Previous work relied on news articles collected from so-called “reputable” and “suspicious” websites and labeled accordingly. We leverage fact-checking websites to collect individually-labeled news articles with regard to the veracity of their content and use this data to test the cross-domain generalization of a classifier trained on bigger text collections but labeled according to source reputation. Our results suggest that reputation-based classification is not sufficient for predicting the veracity level of the majority of news articles, and that the system performance on different test datasets depends on topic distribution. Therefore collecting well-balanced and carefully-assessed training data is a priority for developing robust misinformation detection systems. inproceedings torabiasr-taboada:2018:FEVER Crowdsourcing Semantic Label Propagation in Relation Classification AncaDumitrache LoraAroyo ChrisWelty Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 16–21 http://www.aclweb.org/anthology/W18-5503 Distant supervision is a popular method for performing relation extraction from text that is known to produce noisy labels. Most progress in relation extraction and classification has been made with crowdsourced corrections to distant-supervised labels, and there is evidence that indicates still more would be better. In this paper, we explore the problem of propagating human annotation signals gathered for open-domain relation classification through the CrowdTruth methodology for crowdsourcing, that captures ambiguity in annotations by measuring inter-annotator disagreement. Our approach propagates annotations to sentences that are similar in a low dimensional embedding space, expanding the number of labels by two orders of magnitude. Our experiments show significant improvement in a sentence-level multi-class relation classifier. inproceedings dumitrache-aroyo-welty:2018:FEVER Retrieve and Re-rank: A Simple and Effective IR Approach to Simple Question Answering over Knowledge Graphs VishalGupta ManojChinnakotla ManishShrivastava Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 22–27 http://www.aclweb.org/anthology/W18-5504 SimpleQuestions is a commonly used benchmark inproceedings gupta-chinnakotla-shrivastava:2018:FEVER Information Nutrition Labels: A Plugin for Online News Evaluation VincentiusKevin BirteHögden ClaudiaSchwenger AliSahan NeeluMadan PiushAggarwal AnushaBangaru FaridMuradov AhmetAker Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 28–33 http://www.aclweb.org/anthology/W18-5505 In this paper we present a browser plugin NewsScan that assists online news readers in evaluating the quality of online content they read by providing information nutrition labels for online news articles. In analogy to groceries, where nutrition labels help consumers make choices that they consider best for themselves, information nutrition labels tag online news articles with data that help readers judge the articles they engage with. This paper discusses the choice of the labels, their implementation and visualization. inproceedings kevin-EtAl:2018:FEVER Joint Modeling for Query Expansion and Information Extraction with Reinforcement Learning MotokiTaniguchi YasuhideMiura TomokoOhkuma Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 34–39 http://www.aclweb.org/anthology/W18-5506 Information extraction about an event can be improved by incorporating external evidence. inproceedings taniguchi-miura-ohkuma:2018:FEVER Towards Automatic Fake News Detection: Cross-Level Stance Detection in News Articles CostanzaConforti Mohammad TaherPilehvar NigelCollier Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 40–49 http://www.aclweb.org/anthology/W18-5507 In this paper, we propose to adapt the four-staged pipeline proposed by Zubiaga et al. (2018) for the Rumor Verification task to the problem of Fake News Detection. We show that the recently released Fnc-1 corpus covers two of its steps, namely the Tracking and the Stance Detection task. We identify asymmetry in length to be a key characteristic of the latter step, when adapted to the framework of Fake News Detection and propose to handle it as a specific type of Cross-Level Stance Detection. Inspired by theories from the field of Journalism Studies, we implement and test two architectures to successfully model the internal structure of an article and its interactions with a claim. inproceedings conforti-pilehvar-collier:2018:FEVER Belittling the Source: Trustworthiness Indicators to Obfuscate Fake News on the Web DiegoEsteves Aniketh JanardhanReddy PiyushChawla JensLehmann Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 50–59 http://www.aclweb.org/anthology/W18-5508 With the growth of the internet, the number of fake-news online has been proliferating every year. The consequences of such phenomena are manifold, ranging from lousy decision-making process to bullying and violence episodes. Therefore, fact-checking algorithms became a valuable asset. To this aim, an important step to detect fake-news is to have access to a credibility score for a given information source. However, most of the widely used Web indicators have either been shut-down to the public (e.g., Google PageRank) or are not free for use (Alexa Rank). Further existing databases are short-manually curated lists of online sources, which do not scale. Finally, most of the research on the topic is theoretical-based or explore confidential data in a restricted simulation environment. In this paper we explore current research, highlight the challenges and propose solutions to tackle the problem of classifying websites into a credibility scale. The proposed model automatically extracts source reputation cues and computes a credibility factor, providing valuable insights which can help in belittling dubious and confirming trustful unknown websites. Experimental results outperform state of the art in the 2-classes and 5-classes setting. inproceedings esteves-EtAl:2018:FEVER Automated Fact-Checking of Claims in Argumentative Parliamentary Debates NonaNaderi GraemeHirst Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 60–65 http://www.aclweb.org/anthology/W18-5509 We present an automated approach to distinguish true, false, stretch, and dodge statements in questions and answers in the Canadian Parliament. We leverage the truthfulness annotations of a U.S. fact-checking corpus by training a neural net model and incorporating the prediction probabilities into our models. inproceedings naderi-hirst:2018:FEVER Stance Detection in Fake News A Combined Feature Representation BilalGhanem PaoloRosso FranciscoRangel Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 66–71 http://www.aclweb.org/anthology/W18-5510 With the uncontrolled increasing of fake news and rumors over the Web, different approaches have been proposed to address the problem. In this paper, we present an approach that combines lexical, word embeddings and n-gram features to detect the stance in fake news. Our approach has been tested on the Fake News Challenge (FNC-1) dataset. Given a news title-article pair, the FNC-1 task aims at determining the relevance of the article and the title. Our proposed approach has achieved an accurate result (59.6% Macro F1) that is close to the state-of-the-art result with 0.013 difference using a simple feature representation. Furthermore, we have investigated the importance of different lexicons in the detection of the classification labels. inproceedings ghanem-rosso-rangel:2018:FEVER Zero-shot Relation Classification as Textual Entailment AbiolaObamuyide AndreasVlachos Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 72–78 http://www.aclweb.org/anthology/W18-5511 We consider the task of relation classification, and pose this task as one of textual entailment. We show that this formulation leads to several advantages, including the ability to (i) perform zero-shot relation classification by exploiting relation descriptions, (ii) utilize existing textual entailment models, and (iii) leverage readily available textual entailment datasets, to enhance the performance of relation classification systems. Our experiments show that the proposed approach achieves 20.16% and 61.32% in F1 zero-shot classification performance on two datasets, which further improved to 22.80% and 64.78% respectively with the use of conditional encoding. inproceedings obamuyide-vlachos:2018:FEVER Teaching Syntax by Adversarial Distraction JuhoKim ChristopherMalon AsimKadav Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 79–84 http://www.aclweb.org/anthology/W18-5512 Existing entailment datasets mainly pose problems which can be answered without attention to grammar or word order. Learning syntax requires comparing examples where different grammar and word order change the desired classification. We introduce several datasets based on synthetic transformations of natural entailment examples in SNLI or FEVER, to teach aspects of grammar and word order. We show that without retraining, popular entailment models are unaware that these syntactic differences change meaning. With retraining, some but not all popular entailment models can learn to compare the syntax properly. inproceedings kim-malon-kadav:2018:FEVER Where is Your Evidence: Improving Fact-checking by Justification Modeling TariqAlhindi SavvasPetridis SmarandaMuresan Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 85–90 http://www.aclweb.org/anthology/W18-5513 Fact-checking is a journalistic practice that compares a claim made publicly against trusted sources of facts. Wang (2017) introduced a large dataset of validated claims from the POLITIFACT .com website (LIAR dataset), enabling the development of machine learning approaches for fact-checking. However, approaches based on this dataset have focused primarily on modeling the claim and speaker-related metadata, without considering the evidence used by humans in labeling the claims. We extend the LIAR dataset by automatically extracting the justification from the fact-checking article used by humans to label a given claim. We show that modeling the extracted justification in conjunction with the claim (and metadata) provides a significant improvement regardless of the machine learning model used (feature-based or deep learning) both in a binary classification task (true, false) and in a six-way classification task (pants on fire, false, mostly false, half true, mostly true, true). inproceedings alhindi-petridis-muresan:2018:FEVER Affordance Extraction and Inference based on Semantic Role Labeling DanielLoureiro AlípioJorge Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 91–96 http://www.aclweb.org/anthology/W18-5514 Common-sense reasoning is becoming increasingly important for the advancement of Natural Language Processing. While word embeddings have been very successful, they cannot explain which aspects of 'coffee' and 'tea' make them similar, or how they could be related to 'shop'. In this paper, we propose an explicit word representation that builds upon the Distributional Hypothesis to represent meaning from semantic roles, and allow inference of relations from their meshing, as supported by the affordance-based Indexical Hypothesis. We find that our model improves the state-of-the-art on unsupervised word similarity tasks while allowing for direct inference of new relations from the same vector space. inproceedings loureiro-jorge:2018:FEVER UCL Machine Reading Group: Four Factor Framework For Fact Finding (HexaF) TakumaYoneda JeffMitchell JohannesWelbl PontusStenetorp SebastianRiedel Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 97–102 http://www.aclweb.org/anthology/W18-5515 Our system is a four stage model consisting of document retrieval, sentence retrieval, natural language inference and aggregation. Document retrieval attempts to find the name of a Wikipedia article in the claim, and then ranks each article based on capitalisation, sentence position and token match features. A set of sentences are then retrieved from the top ranked articles, based on token matches with the claim and position in the article. A natural language inference model is then applied to each of these sentences paired with the claim, giving a prediction for each potential evidence. These predictions are then aggregated using a simple MLP, and the sentences are reranked to keep only the evidence consistent with the final prediction. inproceedings yoneda-EtAl:2018:FEVER Multi-Sentence Textual Entailment for Claim Verification AndreasHanselowski HaoZhang ZileLi DaniilSorokin BenjaminSchiller ClaudiaSchulz IrynaGurevych Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 103–108 http://www.aclweb.org/anthology/W18-5516 The Fact Extraction and VERification inproceedings hanselowski-EtAl:2018:FEVER Team Papelo: Transformer Networks at FEVER ChristopherMalon Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 109–113 http://www.aclweb.org/anthology/W18-5517 We develop a system for the FEVER fact extraction and verification challenge that uses a high precision entailment classifier based on transformer networks pretrained with language modeling, to classify a broad set of potential evidence. The precision of the entailment classifier allows us to enhance recall by considering every statement from several articles to decide upon each claim. We include not only the articles best matching the claim text by TFIDF score, but read additional articles whose titles match named entities and capitalized expressions occurring in the claim text. The entailment module evaluates potential evidence one statement at a time, together with the title of the page the evidence came from (providing a hint about possible pronoun antecedents). In preliminary evaluation, the system achieves .5736 FEVER score, .6108 label accuracy, and .6485 evidence F1 on the FEVER shared task test set. inproceedings malon:2018:FEVER Uni-DUE Student Team: Tackling fact checking through decomposable attention neural network JanKowollik AhmetAker Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 114–118 http://www.aclweb.org/anthology/W18-5518 In this paper we present our system for the FEVER Challenge. inproceedings kowollik-aker:2018:FEVER SIRIUS-LTG: An Entity Linking Approach to Fact Extraction and Verification FarhadNooralahzadeh LiljaØvrelid Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 119–123 http://www.aclweb.org/anthology/W18-5519 This article presents the SIRIUS-LTG system for the Fact Extraction and VERification (FEVER) SharedTask. It consists of three components: 1)Wikipedia Page Retrieval: First we extract the entities in the claim, then we find potential Wikipedia URI candidates for each of the entities using a SPARQL query over DBpedia 2)Sentence selection: We investigate various techniques i.e. Smooth Inverse Frequency(SIF), Word Mover’s Distance (WMD), Soft-Cosine Similarity, Cosine similarity with uni-gram Term Frequency Inverse Document Frequency (TF-IDF) to rank sentences by their similarity to the claim. 3)Textual Entailment: We compare three models for the task of claim classification. We apply a Decomposable Attention (DA) model (Parikh et al., 2016), a Decomposed Graph Entailment (DGE) model (Khot et al., 2018) and a Gradient-Boosted Decision Trees (TalosTree) model (Sean et al.,2017) for this task. The experiments show that the pipeline with simple Cosine Similarity using TFIDF in sentence selection along with DA model as labelling model achieves the best results on the development set (F1evidence: 32.17, label accuracy: 59.61 andFEVER score: 0.3778). Furthermore, it obtains 30.19, 48.87 and 36.55 in terms of F1 evidence, label accuracy and FEVER score, respectively, on the test set. Our system ranks 15th among 23 participants in the shared task prior to any human-evaluation of the evidence. inproceedings nooralahzadeh-vrelid:2018:FEVER Integrating Entity Linking and Evidence Ranking for Fact Extraction and Verification MotokiTaniguchi TomokiTaniguchi TakumiTakahashi YasuhideMiura TomokoOhkuma Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 124–126 http://www.aclweb.org/anthology/W18-5520 We describe here our system and results on the FEVER shared task. inproceedings taniguchi-EtAl:2018:FEVER Robust Document Retrieval and Individual Evidence Modeling for Fact Extraction and Verification. TuhinChakrabarty TariqAlhindi SmarandaMuresan Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 127–131 http://www.aclweb.org/anthology/W18-5521 This paper presents the ColumbiaNLP submission for the FEVER Workshop Shared Task. Our system is an end-to-end pipeline that extracts factual evidence from Wikipedia and infers a decision about the truthfulness of the claim based on the extracted evidence. Our pipeline achieves significant improvement over the baseline for all the components (Document Retrieval, Sentence Selection and Textual Entailment) both on the development set and the test set. Our team finished 6th out of 24 teams on the leader-board based on the preliminary results with a FEVER score of 49.06 on the blind test set compared to 27.45 of the baseline system. inproceedings chakrabarty-alhindi-muresan:2018:FEVER DeFactoNLP: Fact Verification using Entity Recognition, TFIDF Vector Comparison and Decomposable Attention Aniketh JanardhanReddy GilRocha DiegoEsteves Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 132–137 http://www.aclweb.org/anthology/W18-5522 In this paper, we describe DeFactoNLP, the system we designed for the FEVER 2018 Shared Task. The aim of this task was to conceive a system that can not only automatically assess the veracity of a claim but also retrieve evidence supporting this assessment from Wikipedia. In our approach, the Wikipedia documents whose Term Frequency-Inverse Document Frequency (TFIDF) vectors are most similar to the vector of the claim and those documents whose names are similar to those of the named entities (NEs) mentioned in the claim are identified as the documents which might contain evidence. The sentences in these documents are then supplied to a textual entailment recognition module. This module calculates the probability of each sentence supporting the claim, contradicting the claim or not providing any relevant information to assess the veracity of the claim. Various features computed using these probabilities are finally used by a Random Forest classifier to determine the overall truthfulness of the claim. The sentences which support this classification are returned as evidence. Our approach achieved a 0.4277 evidence F1-score, a 0.5136 label accuracy and a 0.3833 FEVER score. inproceedings reddy-rocha-esteves:2018:FEVER An End-to-End Multi-task Learning Model for Fact Checking sizhenli ShuaiZhao BoCheng HaoYang Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 138–144 http://www.aclweb.org/anthology/W18-5523 With huge amount of information generated every day on the web, fact checking is an im- portant and challenging task which can help people identify the authenticity of most claims as well as providing evidences selected from knowledge source like Wikipedia. Here we decompose this problem into two parts: an en- tity linking task (retrieving relative Wikipedia pages) and recognizing textual entailment be- tween the claim and selected pages. In this pa- per, we present an end-to-end multi-task learn- ing with bi-direction attention (EMBA) model to classify the claim as "supports", "refutes" or "not enough info" with respect to the pages retrieved and detect sentences as evidence at the same time. We conduct experiments on the FEVER (Fact Extraction and VERification) paper test dataset and shared task test dataset, a new public dataset for verification against tex- tual sources. Experimental results show that our method achieves comparable performance compared with the baseline system. inproceedings li-EtAl:2018:FEVER Team GESIS Cologne: An all in all sentence-based approach for FEVER WolfgangOtto Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 145–149 http://www.aclweb.org/anthology/W18-5524 In this system description of our pipeline to participate at the Fever Shared Task, we describe our sentence-based approach. Throughout all steps of our pipeline, we regarded single sentences as our processing unit. In our IR- inproceedings otto:2018:FEVER Team SWEEPer: Joint Sentence Extraction and Fact Checking with Pointer Networks ChristopherHidey MonaDiab Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 150–155 http://www.aclweb.org/anthology/W18-5525 Our model for fact checking and verification consists of two stages: 1) identifying relevant documents using lexical and syntactic features from the claim and first two sentences in the Wikipedia article and 2) jointly modeling sentence extraction and verification. As the tasks of fact checking and finding evidence are dependent on each other, an ideal model would consider the veracity of the claim when finding evidence and also find only the evidence that supports/refutes the position of the claim. We thus jointly model the second stage by using a pointer network with the claim and evidence sentence represented using the ESIM module. For stage 2, we first train both components using multi-task learning over a larger memory of extracted sentences, then tune parameters to first extract sentences and predict the relation from only the extracted sentences. inproceedings hidey-diab:2018:FEVER QED: A fact verification system for the FEVER shared task JacksonLuken NanjiangJiang Marie-Catherinede Marneffe Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 156–160 http://www.aclweb.org/anthology/W18-5526 This paper describes our system submission to the 2018 Fact Extraction and VERification (FEVER) shared task. The system uses a heuristics-based approach for evidence extraction and a modified version of the inference model by Parikh et al. (2016) for classification. Our process is broken down into three modules: potentially relevant documents are gathered based on key phrases in the claim, then any possible evidence sentences inside those documents are extracted, and finally our classifier discards any evidence deemed irrelevant and uses the remaining to classify the claim's veracity. Our system beats the shared task baseline by 12% and is successful at finding correct evidence (evidence retrieval F1 of 62.5% on the development set). inproceedings luken-jiang-demarneffe:2018:FEVER Team UMBC-FEVER : Claim verification using Semantic Lexical Resources AnkurPadia FrancisFerraro TimFinin Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 161–165 http://www.aclweb.org/anthology/W18-5527 We describe our system used in the 2018 inproceedings padia-ferraro-finin:2018:FEVER A mostly unlexicalized model for recognizing textual entailment MithunPaul RebeccaSharp MihaiSurdeanu Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) November 2018

Brussels, Belgium

Association for Computational Linguistics 166–171 http://www.aclweb.org/anthology/W18-5528 Many approaches to automatically recognizing entailment relations have employed classifiers over hand engineered lexicalized features, or deep learning models that implicitly capture lexicalization through word embeddings. This reliance on lexicalization may complicate the adaptation of these tools between domains. For example, such a system trained in the news domain may learn that a sentence like “Palestinians recognize Texas as part of Mexico” tends to be unsupported, but this fact (and its corresponding lexicalized cues) have no value in, say, a scientific domain. To mitigate this dependence on lexicalized information, in this paper we propose a model that reads two sentences, from any given domain, to determine entailment without using lexicalized features. Instead our model relies on features that are either unlexicalized or are domain independent such as proportion of negated verbs, antonyms, or noun overlap. In its current implementation, this model does not perform well on the FEVER dataset, due to two reasons. First, for the information retrieval portion of the task we used the baseline system provided, since this was not the aim of our project. Second, this is work in progress and we still are in the process of identifying more features and gradually increasing the accuracy of our model. In the end, we hope to build a generic end-to-end classifier, which can be used in a domain outside the one in which it was trained, with no or minimal re-training. inproceedings paul-sharp-surdeanu:2018:FEVER