W. Victor Yarlott
Improving the Identification of the Discourse Function of News Article Paragraphs
Deya Banisakher | W. Victor Yarlott | Mohammed Aldawsari | Naphtali Rishe | Mark Finlayson
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events
Identifying the discourse structure of documents is an important task in understanding written text. Building on prior work, we demonstrate an improved approach to automatically identifying the discourse function of paragraphs in news articles. We start with the hierarchical theory of news discourse developed by van Dijk (1988) which proposes how paragraphs function within news articles. This discourse information is a level intermediate between phrase- or sentence-sized discourse segments and document genre, characterizing how individual paragraphs convey information about the events in the storyline of the article. Specifically, the theory categorizes the relationships between narrated events and (1) the overall storyline (such as Main Events, Background, or Consequences) as well as (2) commentary (such as Verbal Reactions and Evaluations). We trained and tested a linear chain conditional random field (CRF) with new features to model van Dijk’s labels and compared it against several machine learning models presented in previous work. Our model significantly outperformed all baselines and prior approaches, achieving an average of 0.71 F1 score which represents a 31.5% improvement over the previously best-performing support vector machine model.
A Straightforward Approach to Narratologically Grounded Character Identification
Labiba Jahan | Rahul Mittal | W. Victor Yarlott | Mark Finlayson
Proceedings of the 28th International Conference on Computational Linguistics
One of the most fundamental elements of narrative is character: if we are to understand a narrative, we must be able to identify the characters of that narrative. Therefore, character identification is a critical task in narrative natural language understanding. Most prior work has lacked a narratologically grounded definition of character, instead relying on simplified or implicit definitions that do not capture essential distinctions between characters and other referents in narratives. In prior work we proposed a preliminary definition of character that was based in clear narratological principles: a character is an animate entity that is important to the plot. Here we flesh out this concept, demonstrate that it can be reliably annotated (0.78 Cohen’s κ), and provide annotations of 170 narrative texts, drawn from 3 different corpora, containing 1,347 character co-reference chains and 21,999 non-character chains that include 3,937 animate chains. Furthermore, we have shown that a supervised classifier using a simple set of easily computable features can effectively identify these characters (overall F1 of 0.90). A detailed error analysis shows that character identification is first and foremost affected by co-reference quality, and further, that the shorter a chain is the harder it is to effectively identify as a character. We release our code and data for the benefit of other researchers
Identifying the Discourse Function of News Article Paragraphs
W. Victor Yarlott | Cristina Cornelio | Tian Gao | Mark Finlayson
Proceedings of the Workshop Events and Stories in the News 2018
Discourse structure is a key aspect of all forms of text, providing valuable information both to humans and machines. We applied the hierarchical theory of news discourse developed by van Dijk to examine how paragraphs operate as units of discourse structure within news articles—what we refer to here as document-level discourse. This document-level discourse provides a characterization of the content of each paragraph that describes its relation to the events presented in the article (such as main events, backgrounds, and consequences) as well as to other components of the story (such as commentary and evaluation). The purpose of a news discourse section is of great utility to story understanding as it affects both the importance and temporal order of items introduced in the text—therefore, if we know the news discourse purpose for different sections, we should be able to better rank events for their importance and better construct timelines. We test two hypotheses: first, that people can reliably annotate news articles with van Dijk’s theory; second, that we can reliably predict these labels using machine learning. We show that people have a high degree of agreement with each other when annotating the theory (F1 > 0.8, Cohen’s kappa > 0.6), demonstrating that it can be both learned and reliably applied by human annotators. Additionally, we demonstrate first steps toward machine learning of the theory, achieving a performance of F1 = 0.54, which is 65% of human performance. Moreover, we have generated a gold-standard, adjudicated corpus of 50 documents for document-level discourse annotation based on the ACE Phase 2 corpus.
- Mark Finlayson 3
- Deya Banisakher 1
- Mohammed Aldawsari 1
- Naphtali Rishe 1
- Labiba Jahan 1
- show all...