Laia Mayol


2020

We take a close look at a recent dataset of TED-talks annotated with the questions they implicitly evoke, TED-Q (Westera et al., 2020). We test to what extent the relation between a discourse and the questions it evokes is merely one of similarity or association, as opposed to deeper semantic/pragmatic interpretation. We do so by turning the TED-Q dataset into a binary classification task, constructing an analogous task from explicit questions we extract from the BookCorpus (Zhu et al., 2015), and fitting a BERT-based classifier alongside models based on different notions of similarity. The BERT-based classifier, achieving close to human performance, outperforms all similarity-based models, suggesting that there is more to identifying true evoked questions than plain similarity.
We present a new dataset of TED-talks annotated with the questions they evoke and, where available, the answers to these questions. Evoked questions represent a hitherto mostly unexplored type of linguistic data, which promises to open up important new lines of research, especially related to the Question Under Discussion (QUD)-based approach to discourse structure. In this paper we introduce the method and open the first installment of our data to the public. We summarize and explore the current dataset, illustrate its potential by providing new evidence for the relation between predictability and implicitness – capitalizing on the already existing PDTB-style annotations for the texts we use – and outline its potential for future research. The dataset should be of interest, at its current scale, to researchers on formal and experimental pragmatics, discourse coherence, information structure, discourse expectations and processing. Our data-gathering procedure is designed to scale up, relying on crowdsourcing by non-expert annotators, with its utility for Natural Language Processing in mind (e.g., dialogue systems, conversational question answering).

2018

The literature on Romance null-subject languages has often postulated a division of labor between Null and Overt pronouns: Nulls prefer to retrieve an antecedent in subject position, whereas Overts prefer an antecedent in a lower syntactic position (Carminati, 2002). However, recent research on English pronouns (Rohde and Kehler, 2014) has shown grammatical function alone cannot explain pronoun interpretation. According to these models, pronoun interpretation and production are sensitive to different sets of factors and, instead of being mirror images of each other, are related probabilistically in a Bayesian fashion. This paper tests this model with Catalan data from two discourse-completion experiments to study the grammatical and pragmatic factors that affect the interpretation and production of Null and Overt pronouns. Our main result is that both Null and Overt pronouns present asymmetries regarding their interpretation and production: (1) the production of Null pronouns is affected mainly by grammatical factors (they are subject-biased), but their interpretation is also influenced by pragmatic factors (in particular, rhetorical relations), and (2) while Overt pronouns have a strong interpretation bias towards the object, the data indicates that they are not the preferred form to refer to the object.