Massimo Poesio

Also published as: M. Poesio


2024

pdf bib
Using In-context Learning to Automate AI Image Generation for a Gamified Text Labelling Task
Fatima Althani | Chris Madge | Massimo Poesio
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

This paper explores a novel automated method to produce AI-generated images for a text-labelling gamified task. By leveraging the in-context learning capabilities of GPT-4, we automate the optimisation of text-to-image prompts to align with the text being labelled in the part-of-speech tagging task. As an initial evaluation, we compare the optimised prompts to the original sentences based on imageability and concreteness scores. Our results revealed that optimised prompts had significantly higher imageability and concreteness scores. Moreover, to evaluate text-to-image outputs, we generate images using Stable Diffusion XL based on the two prompt types, optimised prompts and the original sentences. Using the automated LIAON-Aesthetic predictor model, we assigned aesthetic scores for the generated images. This resulted in the outputs using optimised prompts scoring significantly higher in predicted aesthetics than those using original sentences as prompts. Our preliminary findings suggest that this methodology provides significantly more aesthetic text-to-image outputs than using the original sentence as a prompt. While the initial results are promising, the text labelling task and AI-generated images presented in this paper have yet to undergo human evaluation.

pdf bib
Linguistic Acceptability and Usability Enhancement: A Case Study of GWAP Evaluation and Redesign
Wateen Abdullah Aliady | Massimo Poesio
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

Collecting high-quality annotations for Natural Language Processing (NLP) tasks poses challenges. Gamified annotation systems, like Games-with-a-Purpose (GWAP), have become popular tools for data annotation. For GWAPs to be effective, they must be user-friendly and produce high-quality annotations to ensure the collected data’s usefulness. This paper investigates the effectiveness of a gamified approach through two specific studies on an existing GWAP designed for collecting NLP coreference judgments. The first study involved preliminary usability testing using the concurrent think-aloud method to gather open-ended feedback. This feedback was crucial in pinpointing design issues. Following this, we conducted semi-structured interviews with our participants, and the insights collected from these interviews were instrumental in crafting player personas, which informed design improvements aimed at enhancing user experience. The outcomes of our research have been generalized to benefit other GWAP implementations. The second study evaluated the linguistic acceptability and reliability of the data collected through our GWAP. Our findings indicate that our GWAP produced reliable corpora with 91.49% accuracy and 0.787 Cohen’s kappa.

pdf bib
Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Anna Nedoluzhko | Massimo Poesio | Sameer Pradhan | Vincent Ng
Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib
Soft metrics for evaluation with disagreements: an assessment
Giulia Rizzi | Elisa Leonardelli | Massimo Poesio | Alexandra Uma | Maja Pavlovic | Silviu Paun | Paolo Rosso | Elisabetta Fersini
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

The move towards preserving judgement disagreements in NLP requires the identification of adequate evaluation metrics. We identify a set of key properties that such metrics should have, and assess the extent to which natural candidates for soft evaluation such as Cross Entropy satisfy such properties. We employ a theoretical framework, supported by a visual approach, by practical examples, and by the analysis of a real case scenario. Our results indicate that Cross Entropy can result in fairly paradoxical results in some cases, whereas other measures Manhattan distance and Euclidean distance exhibit a more intuitive behavior, at least for the case of binary classification.

pdf bib
The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation
Maja Pavlovic | Massimo Poesio
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

Recent studies focus on exploring the capability of Large Language Models (LLMs) for data annotation. Our work, firstly, offers a comparative overview of twelve such studies that investigate labelling with LLMs, particularly focusing on classification tasks. Secondly, we present an empirical analysis that examines the degree of alignment between the opinion distributions returned by GPT and those provided by human annotators across four subjective datasets. Our analysis supports a minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

pdf bib
Analyzing and Enhancing Clarification Strategies for Ambiguous References in Consumer Service Interactions
Changling Li | Yujian Gan | Zhenrong Yang | Youyang Chen | Xinxuan Qiu | Yanni Lin | Matthew Purver | Massimo Poesio
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

When customers present ambiguous references, service staff typically need to clarify the customers’ specific intentions. To advance research in this area, we collected 1,000 real-world consumer dialogues with ambiguous references. This dataset will be used for subsequent studies to identify ambiguous references and generate responses. Our analysis of the dataset revealed common strategies employed by service staff, including directly asking clarification questions (CQ) and listing possible options before asking a clarification question (LCQ). However, we found that merely using CQ often fails to fully satisfy customers. In contrast, using LCQ, as well as recommending specific products after listing possible options, proved more effective in resolving ambiguous references and enhancing customer satisfaction.

pdf bib
Polysemy—Evidence from Linguistics, Behavioral Science, and Contextualized Language Models
Janosch Haber | Massimo Poesio
Computational Linguistics, Volume 50, Issue 1 - March 2024

Polysemy is the type of lexical ambiguity where a word has multiple distinct but related interpretations. In the past decade, it has been the subject of a great many studies across multiple disciplines including linguistics, psychology, neuroscience, and computational linguistics, which have made it increasingly clear that the complexity of polysemy precludes simple, universal answers, especially concerning the representation and processing of polysemous words. But fuelled by the growing availability of large, crowdsourced datasets providing substantial empirical evidence; improved behavioral methodology; and the development of contextualized language models capable of encoding the fine-grained meaning of a word within a given context, the literature on polysemy recently has developed more complex theoretical analyses. In this survey we discuss these recent contributions to the investigation of polysemy against the backdrop of a long legacy of research across multiple decades and disciplines. Our aim is to bring together different perspectives to achieve a more complete picture of the heterogeneity and complexity of the phenomenon of polysemy. Specifically, we highlight evidence supporting a range of hybrid models of the mental processing of polysemes. These hybrid models combine elements from different previous theoretical approaches to explain patterns and idiosyncrasies in the processing of polysemous that the best known models so far have failed to account for. Our literature review finds that (i) traditional analyses of polysemy can be limited in their generalizability by loose definitions and selective materials; (ii) linguistic tests provide useful evidence on individual cases, but fail to capture the full range of factors involved in the processing of polysemous sense extensions; and (iii) recent behavioral (psycho) linguistics studies, large-scale annotation efforts, and investigations leveraging contextualized language models provide accumulating evidence suggesting that polysemous sense similarity covers a wide spectrum between identity of sense and homonymy-like unrelatedness of meaning. We hope that the interdisciplinary account of polysemy provided in this survey inspires further fundamental research on the nature of polysemy and better equips applied research to deal with the complexity surrounding the phenomenon, for example, by enabling the development of benchmarks and testing paradigms for large language models informed by a greater portion of the rich evidence on the phenomenon currently available.

pdf bib
Polysemy through the lens of psycholinguistic variables: a dataset and an evaluation of static and contextualized language models
Andrea Bruera | Farbod Zamani | Massimo Poesio
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Polysemes are words that can have different senses depending on the context of utterance: for instance, ‘newspaper’ can refer to an organization (as in ‘manage the newspaper’) or to an object (as in ‘open the newspaper’). Contrary to a large body of evidence coming from psycholinguistics, polysemy has been traditionally modelled in NLP by assuming that each sense should be given a separate representation in a lexicon (e.g. WordNet). This led to the current situation, where datasets used to evaluate the ability of computational models of semantics miss crucial details about the representation of polysemes, thus limiting the amount of evidence that can be gained from their use. In this paper we propose a framework to approach polysemy as a continuous variation in psycholinguistic properties of a word in context. This approach accommodates different sense interpretations, without postulating clear-cut jumps between senses. First we describe a publicly available English dataset that we collected, where polysemes in context (verb-noun phrases) are annotated for their concreteness and body sensory strength. Then, we evaluate static and contextualized language models in their ability to predict the ratings of each polyseme in context, as well as in their ability to capture the distinction among senses, revealing and characterizing in an interpretable way the models’ flaws.

pdf bib
The ARRAU 3.0 Corpus
Massimo Poesio | Maris Camilleri | Paloma Carretero Garcia | Juntao Yu | Mark-Christoph Müller
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

The ARRAU corpus is an anaphorically annotated corpus designed to cover a wide variety of aspects of anaphoric reference in a variety of genres, including both written text and spoken language. The objective of this annotation project is to push forward the state of the art in anaphoric annotation, by overcoming the limitations of current annotation practice and the scope of current models of anaphoric interpretation, which in turn may reveal other issues. The resulting corpus is still therefore very much a work in progress almost twenty years after the project started. In this paper, we discuss the issues identified with the coding scheme used for the previous release, ARRAU 2, and through the use of this corpus for three shared tasks; the proposed solutions to these issues; and the resulting corpus, ARRAU 3.

pdf bib
A Fine-grained citation graph for biomedical academic papers: the finding-citation graph
Yuan Liang | Massimo Poesio | Roonak Rezvani
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Citations typically mention findings as well as papers. To model this richer notion of citation, we introduce a richer form of citation graph with nodes for both academic papers and their findings: the finding-citation graph (FCG). We also present a new pipeline to construct such a graph, which includes a finding identification module and a citation sentence extraction module. From each paper, it extracts rich basic information, abstract, and structured full text first. The abstract and vital sections, such as the results and discussion, are input into the finding identification module. This module identifies multiple findings from a paper, achieving an 80% accuracy in multiple findings evaluation. The full text is input into the citation sentence extraction module to identify inline citation sentences and citation markers, achieving 97.7% accuracy. Then, the graph is constructed using the outputs from the two modules mentioned above. We used the Europe PMC to build such a graph using the pipeline, resulting in a graph with 14.25 million nodes and 76 million edges.

pdf bib
Assessing the Capabilities of Large Language Models in Coreference: An Evaluation
Yujian Gan | Massimo Poesio | Juntao Yu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper offers a nuanced examination of the role Large Language Models (LLMs) play in coreference resolution, aimed at guiding the future direction in the era of LLMs. We carried out both manual and automatic analyses of different LLMs’ abilities, employing different prompts to examine the performance of different LLMs, obtaining a comprehensive view of their strengths and weaknesses. We found that LLMs show exceptional ability in understanding coreference. However, harnessing this ability to achieve state of the art results on traditional datasets and benchmarks isn’t straightforward. Given these findings, we propose that future efforts should: (1) Improve the scope, data, and evaluation methods of traditional coreference research to adapt to the development of LLMs. (2) Enhance the fine-grained language understanding capabilities of LLMs.

pdf bib
Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

pdf bib
Universal Anaphora: The First Three Years
Massimo Poesio | Maciej Ogrodniczuk | Vincent Ng | Sameer Pradhan | Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Amir Zeldes | Anna Nedoluzhko | Michal Novák | Martin Popel | Zdeněk Žabokrtský | Daniel Zeman
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, delivering datasets encoded according to these standards, and developing methods for evaluating models that carry out this type of interpretation. Although several papers on aspects of the initiative have appeared, no overall description of the initiative’s goals, proposals and achievements has been published yet except as an online draft. This paper aims to fill this gap, as well as to discuss its progress so far.

2023

pdf bib
SemEval-2023 Task 11: Learning with Disagreements (LeWiDi)
Elisa Leonardelli | Gavin Abercrombie | Dina Almanea | Valerio Basile | Tommaso Fornaciari | Barbara Plank | Verena Rieser | Alexandra Uma | Massimo Poesio
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

NLP datasets annotated with human judgments are rife with disagreements between the judges. This is especially true for tasks depending on subjective judgments such as sentiment analysis or offensive language detection. Particularly in these latter cases, the NLP community has come to realize that the common approach of reconciling’ these different subjective interpretations risks misrepresenting the evidence. Many NLP researchers have therefore concluded that rather than eliminating disagreements from annotated corpora, we should preserve themindeed, some argue that corpora should aim to preserve all interpretations produced by annotators. But this approach to corpus creation for NLP has not yet been widely accepted. The objective of the Le-Wi-Di series of shared tasks is to promote this approach to developing NLP models by providing a unified framework for training and evaluating with such datasets. We report on the second such shared task, which differs from the first edition in three crucial respects: (i) it focuses entirely on NLP, instead of both NLP and computer vision tasks in its first edition; (ii) it focuses on subjective tasks, instead of covering different types of disagreements as training with aggregated labels for subjective NLP tasks is in effect a misrepresentation of the data; and (iii) for the evaluation, we concentrated on soft approaches to evaluation. This second edition of Le-Wi-Di attracted a wide array of partici- pants resulting in 13 shared task submission papers.

pdf bib
Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu | Silviu Paun | Maris Camilleri | Paloma Garcia | Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Although several datasets annotated for anaphoric reference / coreference exist, even the largest such datasets have limitations in term of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven’t so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to ‘complete’ markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents ( 2K in length).

pdf bib
Proceedings of The Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)
Maciej Ogrodniczuk | Vincent Ng | Sameer Pradhan | Massimo Poesio
Proceedings of The Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)

pdf bib
Data Augmentation for Fake Reviews Detection
Ming Liu | Massimo Poesio
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

In this research, we studied the relationship between data augmentation and model accuracy for the task of fake review detection. We used data generation methods to augment two different fake review datasets and compared the performance of models trained with the original data and with the augmented data. Our results show that the accuracy of our fake review detection model can be improved by 0.31 percentage points on DeRev Test and by 7.65 percentage points on Amazon Test by using the augmented datasets.

pdf bib
The Universal Anaphora Scorer 2.0
Juntao Yu | Michal Novák | Abdulrahman Aloraini | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio
Proceedings of the 15th International Conference on Computational Semantics

The aim of the Universal Anaphora initiative is to push forward the state of the art both in anaphora (coreference) annotation and in the evaluation of models for anaphora resolution. The first release of the Universal Anaphora Scorer (Yu et al., 2022b) supported the scoring not only of identity anaphora as in the Reference Coreference Scorer (Pradhan et al., 2014) but also of split antecedent anaphoric reference, bridging references, and discourse deixis. That scorer was used in the CODI-CRAC 2021/2022 Shared Tasks on Anaphora Resolution in Dialogues (Khosla et al., 2021; Yu et al., 2022a). A modified version of the scorer supporting discontinuous markables and the COREFUD markup format was also used in the CRAC 2022 Shared Task on Multilingual Coreference Resolution (Zabokrtsky et al., 2022). In this paper, we introduce the second release of the scorer, merging the two previous versions, which can score reference with discontinuous markables and zero anaphora resolution.

2022

pdf bib
Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa
Tommaso Fornaciari | Alexandra Uma | Massimo Poesio | Dirk Hovy
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Natural Language Processing (NLP) ‘s applied nature makes it necessary to select the most effective and robust models. Producing slightly higher performance is insufficient; we want to know whether this advantage will carry over to other data sets. Bootstrapped significance tests can indicate that ability. So while necessary, computing the significance of models’ performance differences has many levels of complexity. It can be tedious, especially when the experimental design has many conditions to compare and several runs of experiments. We present BooStSa, a tool that makes it easy to compute significance levels with the BOOtSTrap SAmpling procedure to evaluate models that predict not only standard hard labels but soft-labels (i.e., probability distributions over different classes) as well.

pdf bib
Less Text, More Visuals: Evaluating the Onboarding Phase in a GWAP for NLP
Fatima Althani | Chris Madge | Massimo Poesio
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference

Games-with-a-purpose find attracting players a challenge. To improve player recruitment, we explored two game design elements that can increase player engagement during the onboarding phase; a narrative and a tutorial. In a qualitative study with 12 players of linguistic and language learning games, we examined the effect of presentation format on players’ engagement. Our reflexive thematic analysis found that in the onboarding phase of a GWAP for NLP, presenting players with visuals is expected and pre- senting too much text overwhelms them. Furthermore, players found that the instructions they were presented with lacked linguistic context. Additionally, the tutorial and game interface required refinement as the feedback is unsupportive and the graphics were not clear.

pdf bib
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rose
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

pdf bib
The CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2022 Shared Task on Anaphora Resolution in Dialogues is the second edition of an initiative focused on detecting different types of anaphoric relations in conversations of different kinds. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple tasks focusing individually on these key relations. The second edition of the shared task maintained the focus on these relations and used the same datasets as in 2021, but new test data were annotated, the 2021 data were checked, and new subtasks were added. In this paper, we discuss the annotation schemes, the datasets, the evaluation scripts used to assess the system performance on these tasks, and provide a brief summary of the participating systems and the results obtained across 230 runs from three teams, with most submissions achieving significantly better results than our baseline methods.

pdf bib
Joint Coreference Resolution for Zeros and non-Zeros in Arabic
Abdulrahman Aloraini | Sameer Pradhan | Massimo Poesio
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Most existing proposals about anaphoric zero pronoun (AZP) resolution regard full mention coreference and AZP resolution as two independent tasks, even though the two tasks are clearly related. The main issues that need tackling to develop a joint model for zero and non-zero mentions are the difference between the two types of arguments (zero pronouns, being null, provide no nominal information) and the lack of annotated datasets of a suitable size in which both types of arguments are annotated for languages other than Chinese and Japanese. In this paper, we introduce two architectures for jointly resolving AZPs and non-AZPs, and evaluate them on Arabic, a language for which, as far as we know, there has been no prior work on joint resolution. Doing this also required creating a new version of the Arabic subset of the standard coreference resolution dataset used for the CoNLL-2012 shared task (Pradhan et al.,2012) in which both zeros and non-zeros are included in a single dataset.

pdf bib
Coreference Annotation of an Arabic Corpus using a Virtual World Game
Wateen Abdullah Aliady | Abdulrahman Aloraini | Christopher Madge | Juntao Yu | Richard Bartle | Massimo Poesio
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Coreference resolution is a key aspect of text comprehension, but the size of the available coreference corpora for Arabic is limited in comparison to the size of the corpora for other languages. In this paper we present a Game-With-A-Purpose called Stroll with a Scroll created to collect from players coreference annotations for Arabic. The key contribution of this work is the embedding of the annotation task in a virtual world setting, as opposed to the puzzle-type games used in previously proposed Games-With-A-Purpose for coreference.

pdf bib
ArMIS - The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements
Dina Almanea | Massimo Poesio
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The use of misogynistic and sexist language has increased in recent years in social media, and is increasing in the Arabic world in reaction to reforms attempting to remove restrictions on women lives. However, there are few benchmarks for Arabic misogyny and sexism detection, and in those the annotations are in aggregated form even though misogyny and sexism judgments are found to be highly subjective. In this paper we introduce an Arabic misogyny and sexism dataset (ArMIS) characterized by providing annotations from annotators with different degree of religious beliefs, and provide evidence that such differences do result in disagreements. To the best of our knowledge, this is the first dataset to study in detail the effect of beliefs on misogyny and sexism annotation. We also discuss proof-of-concept experiments showing that a dataset in which disagreements have not been reconciled can be used to train state-of-the-art models for misogyny and sexism detection; and consider different ways in which such models could be evaluated.

pdf bib
The Universal Anaphora Scorer
Juntao Yu | Sopan Khosla | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, deliver datasets encoded according to these standards, and developing methods for evaluating models carrying out this type of interpretation. Such expansion of the scope of anaphora resolution requires a comparable expansion of the scope of the scorers used to evaluate this work. In this paper, we introduce an extended version of the Reference Coreference Scorer (Pradhan et al., 2014) that can be used to evaluate the extended range of anaphoric interpretation included in the current Universal Anaphora proposal. The UA scorer supports the evaluation of identity anaphora resolution and of bridging reference resolution, for which scorers already existed but not integrated in a single package. It also supports the evaluation of split antecedent anaphora and discourse deixis, for which no tools existed. The proposed approach to the evaluation of split antecedent anaphora is entirely novel; the proposed approach to the evaluation of discourse deixis leverages the encoding of discourse deixis proposed in Universal Anaphora to enable the use for discourse deixis of the same metrics already used for identity anaphora. The scorer was tested in the recent CODI-CRAC 2021 Shared Task on Anaphora Resolution in Dialogues.

pdf bib
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Sameer Pradhan | Anna Nedoluzhko | Vincent Ng | Massimo Poesio
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

2021

pdf bib
SemEval-2021 Task 12: Learning with Disagreements
Alexandra Uma | Tommaso Fornaciari | Anca Dumitrache | Tristan Miller | Jon Chamberlain | Barbara Plank | Edwin Simpson | Massimo Poesio
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Disagreement between coders is ubiquitous in virtually all datasets annotated with human judgements in both natural language processing and computer vision. However, most supervised machine learning methods assume that a single preferred interpretation exists for each item, which is at best an idealization. The aim of the SemEval-2021 shared task on learning with disagreements (Le-Wi-Di) was to provide a unified testing framework for methods for learning from data containing multiple and possibly contradictory annotations covering the best-known datasets containing information about disagreements for interpreting language and classifying images. In this paper we describe the shared task and its results.

pdf bib
Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning
Tommaso Fornaciari | Alexandra Uma | Silviu Paun | Barbara Plank | Dirk Hovy | Massimo Poesio
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information: in addition to the standard error computation, we use soft-labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft-labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities, and thereby mitigates overfitting. It significantly improves performance across tasks, beyond the standard approach and prior work.

pdf bib
Stay Together: A System for Single and Split-antecedent Anaphora Resolution
Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Massimo Poesio
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The state-of-the-art on basic, single-antecedent anaphora has greatly improved in recent years. Researchers have therefore started to pay more attention to more complex cases of anaphora such as split-antecedent anaphora, as in “Time-Warner is considering a legal challenge to Telecommunications Inc’s plan to buy half of Showtime Networks Inc–a move that could lead to all-out war between the two powerful companies”. Split-antecedent anaphora is rarer and more complex to resolve than single-antecedent anaphora; as a result, it is not annotated in many datasets designed to test coreference, and previous work on resolving this type of anaphora was carried out in unrealistic conditions that assume gold mentions and/or gold split-antecedent anaphors are available. These systems also focus on split-antecedent anaphors only. In this work, we introduce a system that resolves both single and split-antecedent anaphors, and evaluate it in a more realistic setting that uses predicted mentions. We also start addressing the question of how to evaluate single and split-antecedent anaphors together using standard coreference evaluation metrics.

pdf bib
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

pdf bib
The CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Juntao Yu | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

In this paper, we provide an overview of the CODI-CRAC 2021 Shared-Task: Anaphora Resolution in Dialogue. The shared task focuses on detecting anaphoric relations in different genres of conversations. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple subtasks focusing individually on these key relations. We discuss the evaluation scripts used to assess the system performance on these subtasks, and provide a brief summary of the participating systems and the results obtained across ?? runs from 5 teams, with most submissions achieving significantly better results than our baseline methods.

pdf bib
BERTective: Language Models and Contextual Information for Deception Detection
Tommaso Fornaciari | Federico Bianchi | Massimo Poesio | Dirk Hovy
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Spotting a lie is challenging but has an enormous potential impact on security as well as private and public safety. Several NLP methods have been proposed to classify texts as truthful or deceptive. In most cases, however, the target texts’ preceding context is not considered. This is a severe limitation, as any communication takes place in context, not in a vacuum, and context can help to detect deception. We study a corpus of Italian dialogues containing deceptive statements and implement deep neural models that incorporate various linguistic contexts. We establish a new state-of-the-art identifying deception and find that not all context is equally useful to the task. Only the texts closest to the target, if from the same speaker (rather than questions by an interlocutor), boost performance. We also find that the semantic information in language models such as BERT contributes to the performance. However, BERT alone does not capture the implicit knowledge of deception cues: its contribution is conditional on the concurrent use of attention to learn cues from BERT’s representations.

pdf bib
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Sameer Pradhan | Massimo Poesio | Yulia Grishina | Vincent Ng
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib
Coreference Resolution for the Biomedical Domain: A Survey
Pengcheng Lu | Massimo Poesio
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.

pdf bib
Data Augmentation Methods for Anaphoric Zero Pronouns
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

In pro-drop language like Arabic, Chinese, Italian, Japanese, Spanish, and many others, unrealized (null) arguments in certain syntactic positions can refer to a previously introduced entity, and are thus called anaphoric zero pronouns. The existing resources for studying anaphoric zero pronoun interpretation are however still limited. In this paper, we use five data augmentation methods to generate and detect anaphoric zero pronouns automatically. We use the augmented data as additional training materials for two anaphoric zero pronoun systems for Arabic. Our experimental results show that data augmentation improves the performance of the two systems, surpassing the state-of-the-art results.

pdf bib
Patterns of Polysemy and Homonymy in Contextualised Language Models
Janosch Haber | Massimo Poesio
Findings of the Association for Computational Linguistics: EMNLP 2021

One of the central aspects of contextualised language models is that they should be able to distinguish the meaning of lexically ambiguous words by their contexts. In this paper we investigate the extent to which the contextualised embeddings of word forms that display multiplicity of sense reflect traditional distinctions of polysemy and homonymy. To this end, we introduce an extended, human-annotated dataset of graded word sense similarity and co-predication acceptability, and evaluate how well the similarity of embeddings predicts similarity in meaning. Both types of human judgements indicate that the similarity of polysemic interpretations falls in a continuum between identity of meaning and homonymy. However, we also observe significant differences within the similarity ratings of polysemes, forming consistent patterns for different types of polysemic sense alternation. Our dataset thus appears to capture a substantial part of the complexity of lexical ambiguity, and can provide a realistic test bed for contextualised embeddings. Among the tested models, BERT Large shows the strongest correlation with the collected word sense similarity ratings, but struggles to consistently replicate the observed similarity patterns. When clustering ambiguous word forms based on their embeddings, the model displays high confidence in discerning homonyms and some types of polysemic alternations, but consistently fails for others.

pdf bib
We Need to Consider Disagreement in Evaluation
Valerio Basile | Michael Fell | Tommaso Fornaciari | Dirk Hovy | Silviu Paun | Barbara Plank | Massimo Poesio | Alexandra Uma
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Evaluation is of paramount importance in data-driven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). Current evaluation practice largely hinges on the existence of a single “ground truth” against which we can meaningfully compare the prediction of a model. However, this comparison is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We argue that the current methods of adjudication, agreement, and evaluation need serious reconsideration. Some researchers now propose to minimize disagreement and to fix datasets. We argue that this is a gross oversimplification, and likely to conceal the underlying complexity. Instead, we suggest that we need to better capture the sources of disagreement to improve today’s evaluation practice. We discuss three sources of disagreement: from the annotator, the data, and the context, and show how this affects even seemingly objective tasks. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.

2020

pdf bib
Assessing Polyseme Sense Similarity through Co-predication Acceptability and Contextualised Embedding Distance
Janosch Haber | Massimo Poesio
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Co-predication is one of the most frequently used linguistic tests to tell apart shifts in polysemic sense from changes in homonymic meaning. It is increasingly coming under criticism as evidence is accumulating that it tends to mis-classify specific cases of polysemic sense alteration as homonymy. In this paper, we collect empirical data to investigate these accusations. We asses how co-predication acceptability relates to explicit ratings of polyseme word sense similarity, and how well either measure can be predicted through the distance between target words’ contextualised word embeddings. We find that sense similarity appears to be a major contributor in determining co-predication acceptability, but that co-predication judgements tend to rate especially less similar sense interpretations equally as unacceptable as homonym pairs, effectively mis-classifying these instances. The tested contextualised word embeddings fail to predict word sense similarity consistently, but the similarities between BERT embeddings show a significant correlation with co-predication ratings. We take this finding as evidence that BERT embeddings might be better representations of context than encodings of word meaning.

pdf bib
Anaphoric Zero Pronoun Identification: A Multilingual Approach
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

Pro-drop languages such as Arabic, Chinese, Italian or Japanese allow morphologically null but referential arguments in certain syntactic positions, called anaphoric zero-pronouns. Much NLP work on anaphoric zero-pronouns (AZP) is based on gold mentions, but models for their identification are a fundamental prerequisite for their resolution in real-life applications. Such identification requires complex language understanding and knowledge of real-world entities. Transfer learning models, such as BERT, have recently shown to learn surface, syntactic, and semantic information,which can be very useful in recognizing AZPs. We propose a BERT-based multilingual model for AZP identification from predicted zero pronoun positions, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, this is the first neural network model of AZP identification for Arabic; and our approach outperforms the stateof-the-art for Chinese. Experiment results suggest that BERT implicitly encode information about AZPs through their surrounding context.

pdf bib
Neural Coreference Resolution for Arabic
Abdulrahman Aloraini | Juntao Yu | Massimo Poesio
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

No neural coreference resolver for Arabic exists, in fact we are not aware of any learning-based coreference resolver for Arabic since (Björkelund and Kuhn, 2014). In this paper, we introduce a coreference resolution system for Arabic based on Lee et al’s end-to-end architecture combined with the Arabic version of bert and an external mention detector. As far as we know, this is the first neural coreference resolution system aimed specifically to Arabic, and it substantially outperforms the existing state-of-the-art on OntoNotes 5.0 with a gain of 15.2 points conll F1. We also discuss the current limitations of the task for Arabic and possible approaches that can tackle these challenges.

pdf bib
Multitask Learning-Based Neural Bridging Reference Resolution
Juntao Yu | Massimo Poesio
Proceedings of the 28th International Conference on Computational Linguistics

We propose a multi task learning-based neural model for resolving bridging references tackling two key challenges. The first challenge is the lack of large corpora annotated with bridging references. To address this, we use multi-task learning to help bridging reference resolution with coreference resolution. We show that substantial improvements of up to 8 p.p. can be achieved on full bridging resolution with this architecture. The second challenge is the different definitions of bridging used in different corpora, meaning that hand-coded systems or systems using special features designed for one corpus do not work well with other corpora. Our neural model only uses a small number of corpus independent features, thus can be applied to different corpora. Evaluations with very different bridging corpora (ARRAU, ISNOTES, BASHI and SCICORP) suggest that our architecture works equally well on all corpora, and achieves the SoTA results on full bridging resolution for all corpora, outperforming the best reported results by up to 36.3 p.p..

pdf bib
Free the Plural: Unrestricted Split-Antecedent Anaphora Resolution
Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Massimo Poesio
Proceedings of the 28th International Conference on Computational Linguistics

Now that the performance of coreference resolvers on the simpler forms of anaphoric reference has greatly improved, more attention is devoted to more complex aspects of anaphora. One limitation of virtually all coreference resolution models is the focus on single-antecedent anaphors. Plural anaphors with multiple antecedents-so-called split-antecedent anaphors (as in John met Mary. They went to the movies) have not been widely studied, because they are not annotated in ONTONOTES and are relatively infrequent in other corpora. In this paper, we introduce the first model for unrestricted resolution of split-antecedent anaphors. We start with a strong baseline enhanced by BERT embeddings, and show that we can substantially improve its performance by addressing the sparsity issue. To do this, we experiment with auxiliary corpora where split-antecedent anaphors were annotated by the crowd, and with transfer learning models using element-of bridging references and single-antecedent coreference as auxiliary tasks. Evaluation on the gold annotated ARRAU corpus shows that the out best model uses a combination of three auxiliary corpora achieved F1 scores of 70% and 43.6% when evaluated in a lenient and strict setting, respectively, i.e., 11 and 21 percentage points gain when compared with our baseline.

pdf bib
Aggregation Driven Progression System for GWAPs
Osman Doruk Kicikoglu | Richard Bartle | Jon Chamberlain | Silviu Paun | Massimo Poesio
Workshop on Games and Natural Language Processing

As the uses of Games-With-A-Purpose (GWAPs) broadens, the systems that incorporate its usages have expanded in complexity. The types of annotations required within the NLP paradigm set such an example, where tasks can involve varying complexity of annotations. Assigning more complex tasks to more skilled players through a progression mechanism can achieve higher accuracy in the collected data while acting as a motivating factor that rewards the more skilled players. In this paper, we present the progression technique implemented in Wormingo , an NLP GWAP that currently includes two layers of task complexity. For the experiment, we have implemented four different progression scenarios on 192 players and compared the accuracy and engagement achieved with each scenario.

pdf bib
Speaking Outside the Box: Exploring the Benefits of Unconstrained Input in Crowdsourcing and Citizen Science Platforms
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

Crowdsourcing approaches provide a difficult design challenge for developers. There is a trade-off between the efficiency of the task to be done and the reward given to the user for participating, whether it be altruism, social enhancement, entertainment or money. This paper explores how crowdsourcing and citizen science systems collect data and complete tasks, illustrated by a case study from the online language game-with-a-purpose Phrase Detectives. The game was originally developed to be a constrained interface to prevent player collusion, but subsequently benefited from posthoc analysis of over 76k unconstrained inputs from users. Understanding the interface design and task deconstruction are critical for enabling users to participate in such systems and the paper concludes with a discussion of the idea that social networks can be viewed as form of citizen science platform with both constrained and unconstrained inputs making for a highly complex dataset.

pdf bib
Neural Mention Detection
Juntao Yu | Bernd Bohnet | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Mention detection is an important preprocessing step for annotation and interpretation in applications such as NER and coreference resolution, but few stand-alone neural models have been proposed able to handle the full range of mentions. In this work, we propose and compare three neural network-based approaches to mention detection. The first approach is based on the mention detection part of a state of the art coreference resolution system; the second uses ELMO embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model. Our best model (using a biaffine classifier) achieves gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL coreference annotation setting. The same model achieves improvements of up to 5.3 and 6.2 p.p. when compared with the best-reported mention detection F1 on the CONLL and CRAC coreference data sets respectively in a HIGH F1 annotation setting. We then evaluate our models for coreference resolution by using mentions predicted by our best model in start-of-the-art coreference systems. The enhanced model achieved absolute improvements of up to 1.7 and 0.7 p.p. when compared with our strong baseline systems (pipeline system and end-to-end system) respectively. For nested NER, the evaluation of our model on the GENIA corpora shows that our model matches or outperforms state-of-the-art models despite not being specifically designed for this task.

pdf bib
A Cluster Ranking Model for Full Anaphora Resolution
Juntao Yu | Alexandra Uma | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Anaphora resolution (coreference) systems designed for the CONLL 2012 dataset typically cannot handle key aspects of the full anaphora resolution task such as the identification of singletons and of certain types of non-referring expressions (e.g., expletives), as these aspects are not annotated in that corpus. However, the recently released dataset for the CRAC 2018 Shared Task can now be used for that purpose. In this paper, we introduce an architecture to simultaneously identify non-referring expressions (including expletives, predicative s, and other types) and build coreference chains, including singletons. Our cluster-ranking system uses an attention mechanism to determine the relative importance of the mentions in the same cluster. Additional classifiers are used to identify singletons and non-referring markables. Our contributions are as follows. First all, we report the first result on the CRAC data using system mentions; our result is 5.8% better than the shared task baseline system, which used gold mentions. Second, we demonstrate that the availability of singleton clusters and non-referring expressions can lead to substantially improved performance on non-singleton clusters as well. Third, we show that despite our model not being designed specifically for the CONLL data, it achieves a score equivalent to that of the state-of-the-art system by Kantor and Globerson (2019) on that dataset.

pdf bib
Cross-lingual Zero Pronoun Resolution
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certain syntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifying and resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavily on semantic coherence and lexical relationships. We propose a BERT-based cross-lingual model for zero pronoun resolution, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolution for Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction and fine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating which layer encodes the most suitable representation for the task.

pdf bib
Named Entity Recognition as Dependency Parsing
Juntao Yu | Bernd Bohnet | Massimo Poesio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing, concerned with identifying spans of text expressing references to entities. NER research is often focused on flat entities only (flat NER), ignoring the fact that entity references can be nested, as in [Bank of [China]] (Finkel and Manning, 2009). In this paper, we use ideas from graph-based dependency parsing to provide our model a global view on the input via a biaffine model (Dozat and Manning, 2017). The biaffine model scores pairs of start and end tokens in a sentence which we use to explore all spans, so that the model is able to predict named entities accurately. We show that the model works well for both nested and flat NER through evaluation on 8 corpora and achieving SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points.

pdf bib
Word Sense Distance in Human Similarity Judgements and Contextualised Word Embeddings
Janosch Haber | Massimo Poesio
Proceedings of the Probability and Meaning Conference (PaM 2020)

Homonymy is often used to showcase one of the advantages of context-sensitive word embedding techniques such as ELMo and BERT. In this paper we want to shift the focus to the related but less exhaustively explored phenomenon of polysemy, where a word expresses various distinct but related senses in different contexts. Specifically, we aim to i) investigate a recent model of polyseme sense clustering proposed by Ortega-Andres & Vicente (2019) through analysing empirical evidence of word sense grouping in human similarity judgements, ii) extend the evaluation of context-sensitive word embedding systems by examining whether they encode differences in word sense similarity and iii) compare the word sense similarities of both methods to assess their correlation and gain some intuition as to how well contextualised word embeddings could be used as surrogate word sense similarity judgements in linguistic experiments.

pdf bib
Polygloss - A conversational agent for language practice
Etiene da Cruz Dalcol | Massimo Poesio
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

pdf bib
The QMUL/HRBDT contribution to the NADI Arabic Dialect Identification Shared Task
Abdulrahman Aloraini | Massimo Poesio | Ayman Alhelbawy
Proceedings of the Fifth Arabic Natural Language Processing Workshop

We present the Arabic dialect identification system that we used for the country-level subtask of the NADI challenge. Our model consists of three components: BiLSTM-CNN, character-level TF-IDF, and topic modeling features. We represent each tweet using these features and feed them into a deep neural network. We then add an effective heuristic that improves the overall performance. We achieved an F1-Macro score of 20.77% and an accuracy of 34.32% on the test set. The model was also evaluated on the Arabic Online Commentary dataset, achieving results better than the state-of-the-art.

2019

pdf bib
Crowdsourcing and Aggregating Nested Markable Annotations
Chris Madge | Juntao Yu | Jon Chamberlain | Udo Kruschwitz | Silviu Paun | Massimo Poesio
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

One of the key steps in language resource creation is the identification of the text segments to be annotated, or markables, which depending on the task may vary from nominal chunks for named entity resolution to (potentially nested) noun phrases in coreference resolution (or mentions) to larger text segments in text segmentation. Markable identification is typically carried out semi-automatically, by running a markable identifier and correcting its output by hand–which is increasingly done via annotators recruited through crowdsourcing and aggregating their responses. In this paper, we present a method for identifying markables for coreference annotation that combines high-performance automatic markable detectors with checking with a Game-With-A-Purpose (GWAP) and aggregation using a Bayesian annotation model. The method was evaluated both on news data and data from a variety of other genres and results in an improvement on F1 of mention boundaries of over seven percentage points when compared with a state-of-the-art, domain-independent automatic mention detector, and almost three points over an in-domain mention detector. One of the key contributions of our proposal is its applicability to the case in which markables are nested, as is the case with coreference markables; but the GWAP and several of the proposed markable detectors are task and language-independent and are thus applicable to a variety of other annotation scenarios.

pdf bib
Using Automatically Extracted Minimum Spans to Disentangle Coreference Evaluation from Boundary Detection
Nafise Sadat Moosavi | Leo Born | Massimo Poesio | Michael Strube
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The common practice in coreference resolution is to identify and evaluate the maximum span of mentions. The use of maximum spans tangles coreference evaluation with the challenges of mention boundary detection like prepositional phrase attachment. To address this problem, minimum spans are manually annotated in smaller corpora. However, this additional annotation is costly and therefore, this solution does not scale to large corpora. In this paper, we propose the MINA algorithm for automatically extracting minimum spans to benefit from minimum span evaluation in all corpora. We show that the extracted minimum spans by MINA are consistent with those that are manually annotated by experts. Our experiments show that using minimum spans is in particular important in cross-dataset coreference evaluation, in which detected mention boundaries are noisier due to domain shift. We have integrated MINA into https://github.com/ns-moosavi/coval for reporting standard coreference scores based on both maximum and automatically detected minimum spans.

pdf bib
A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
Massimo Poesio | Jon Chamberlain | Silviu Paun | Juntao Yu | Alexandra Uma | Udo Kruschwitz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This characteristic makes the corpus a unique resource for the study of disagreements on anaphoric interpretation. A second distinctive feature is its rich annotation scheme, covering singletons, expletives, and split-antecedent plurals. Finally, the corpus also comes with labels inferred using a recently proposed probabilistic model of annotation for coreference. The labels are of high quality and make it possible to successfully train a state of the art coreference resolver, including training on singletons and non-referring expressions. The annotation model can also result in more than one label, or no label, being proposed for a markable, thus serving as a baseline method for automatically identifying ambiguous markables. A preliminary analysis of the results is presented.

2018

pdf bib
Comparing Bayesian Models of Annotation
Silviu Paun | Bob Carpenter | Jon Chamberlain | Dirk Hovy | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 6

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

pdf bib
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference
Massimo Poesio | Vincent Ng | Maciej Ogrodniczuk
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib
Anaphora Resolution with the ARRAU Corpus
Massimo Poesio | Yulia Grishina | Varada Kolhatkar | Nafise Moosavi | Ina Roesiger | Adam Roussel | Fabian Simonjetz | Alexandra Uma | Olga Uryupina | Juntao Yu | Heike Zinsmeister
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including non-referring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Task–identity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.

pdf bib
A Probabilistic Annotation Model for Crowdsourcing Coreference
Silviu Paun | Jon Chamberlain | Udo Kruschwitz | Juntao Yu | Massimo Poesio
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The availability of large scale annotated corpora for coreference is essential to the development of the field. However, creating resources at the required scale via expert annotation would be too expensive. Crowdsourcing has been proposed as an alternative; but this approach has not been widely used for coreference. This paper addresses one crucial hurdle on the way to make this possible, by introducing a new model of annotation for aggregating crowdsourced anaphoric annotations. The model is evaluated along three dimensions: the accuracy of the inferred mention pairs, the quality of the post-hoc constructed silver chains, and the viability of using the silver chains as an alternative to the expert-annotated chains in training a state of the art coreference system. The results suggest that our model can extract from crowdsourced annotations coreference chains of comparable quality to those obtained with expert annotation.

2017

pdf bib
Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns
Andrew J. Anderson | Douwe Kiela | Stephen Clark | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 5

Important advances have recently been made using computational semantic models to decode brain activity patterns associated with concepts; however, this work has almost exclusively focused on concrete nouns. How well these models extend to decoding abstract nouns is largely unknown. We address this question by applying state-of-the-art computational models to decode functional Magnetic Resonance Imaging (fMRI) activity patterns, elicited by participants reading and imagining a diverse set of both concrete and abstract nouns. One of the models we use is linguistic, exploiting the recent word2vec skipgram approach trained on Wikipedia. The second is visually grounded, using deep convolutional neural networks trained on Google Images. Dual coding theory considers concrete concepts to be encoded in the brain both linguistically and visually, and abstract concepts only linguistically. Splitting the fMRI data according to human concreteness ratings, we indeed observe that both models significantly decode the most concrete nouns; however, accuracy is significantly greater using the text-based models for the most abstract nouns. More generally this confirms that current computational models are sufficiently advanced to assist in investigating the representational structure of abstract concepts in the brain.

pdf bib
Incongruent Headlines: Yet Another Way to Mislead Your Readers
Sophie Chesney | Maria Liakata | Massimo Poesio | Matthew Purver
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

This paper discusses the problem of incongruent headlines: those which do not accurately represent the information contained in the article with which they occur. We emphasise that this phenomenon should be considered separately from recognised problematic headline types such as clickbait and sensationalism, arguing that existing natural language processing (NLP) methods applied to these related concepts are not appropriate for the automatic detection of headline incongruence, as an analysis beyond stylistic traits is necessary. We therefore suggest a number of alternative methodologies that may be appropriate to the task at hand as a foundation for future work in this area. In addition, we provide an analysis of existing data sets which are related to this work, and motivate the need for a novel data set in this domain.

2016

pdf bib
Coreference Resolution for the Basque Language with BART
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza | Mijail Kabadjov | Massimo Poesio
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

pdf bib
Predicting Brexit: Classifying Agreement is Better than Sentiment and Pollsters
Fabio Celli | Evgeny Stepanov | Massimo Poesio | Giuseppe Riccardi
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

On June 23rd 2016, UK held the referendum which ratified the exit from the EU. While most of the traditional pollsters failed to forecast the final vote, there were online systems that hit the result with high accuracy using opinion mining techniques and big data. Starting one month before, we collected and monitored millions of posts about the referendum from social media conversations, and exploited Natural Language Processing techniques to predict the referendum outcome. In this paper we discuss the methods used by traditional pollsters and compare it to the predictions based on different opinion mining techniques. We find that opinion mining based on agreement/disagreement classification works better than opinion mining based on polarity classification in the forecast of the referendum outcome.

pdf bib
The OnForumS corpus from the Shared Task on Online Forum Summarisation at MultiLing 2015
Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio | Josef Steinberger | Jorge Valderrama | Hugo Zaragoza
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the OnForumS corpus developed for the shared task of the same name on Online Forum Summarisation (OnForumS at MultiLing’15). The corpus consists of a set of news articles with associated readers’ comments from The Guardian (English) and La Repubblica (Italian). It comes with four levels of annotation: argument structure, comment-article linking, sentiment and coreference. The former three were produced through crowdsourcing, whereas the latter, by an experienced annotator using a mature annotation scheme. Given its annotation breadth, we believe the corpus will prove a useful resource in stimulating and furthering research in the areas of Argumentation Mining, Summarisation, Sentiment, Coreference and the interlinks therein.

pdf bib
Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (kappa=.88-.93) but a more variable baseline crowd agreement (kappa=.52-.96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

pdf bib
ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions
Olga Uryupina | Ron Artstein | Antonella Bristot | Federica Cavicchio | Kepa Rodriguez | Massimo Poesio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phenomena to include referentiality and genericity and designed and implemented a methodology for enforcing the consistency of the manual annotation. We believe that the new release of ARRAU provides a valuable material for ongoing research in complex cases of coreference as well as for a variety of related tasks. The corpus is publicly available through LDC.

2015

pdf bib
MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations
George Giannakopoulos | Jeff Kubina | John Conroy | Josef Steinberger | Benoit Favre | Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Combining Minimally-supervised Methods for Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 3

Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi-supervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised methods tend to have very high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised methods. In this paper we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We trained a semi-supervised NER classifier and another one using distant learning techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best base classifiers.

2014

pdf bib
AraNLP: a Java-based Library for the Processing of Arabic Text.
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a free, Java-based library named “AraNLP” that covers various Arabic text preprocessing tools. Although a good number of tools for processing Arabic text already exist, integration and compatibility problems continually occur. AraNLP is an attempt to gather most of the vital Arabic text preprocessing tools into one library that can be accessed easily by integrating or accurately adapting existing tools and by developing new ones when required. The library includes a sentence detector, tokenizer, light stemmer, root stemmer, part-of speech tagger (POS-tagger), word segmenter, normalizer, and a punctuation and diacritic remover.

pdf bib
Identifying fake Amazon reviews as learning from crowds
Tommaso Fornaciari | Massimo Poesio
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts
Andrew J. Anderson | Elia Bruni | Ulisse Bordignon | Massimo Poesio | Marco Baroni
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Semi-supervised Learning Approach to Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

pdf bib
Adapting a State-of-the-art Anaphora Resolution System for Resource-poor Language
Utpal Sikdar | Asif Ekbal | Sriparna Saha | Olga Uryupina | Massimo Poesio
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
On the Use of Homogenous Sets of Subjects in Deceptive Language Analysis
Tommaso Fornaciari | Massimo Poesio
Proceedings of the Workshop on Computational Approaches to Deception Detection

pdf bib
Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities
Francesca Bonin | Fabio Cavulli | Aronne Noriller | Massimo Poesio | Egon W. Stemle
Proceedings of the Sixth Linguistic Annotation Workshop

pdf bib
BART goes multilingual: The UniTN / Essex submission to the CoNLL-2012 Shared Task
Olga Uryupina | Alessandro Moschitti | Massimo Poesio
Joint Conference on EMNLP and CoNLL - Shared Task

pdf bib
On discriminating fMRI representations of abstract WordNet taxonomic categories
Andrew Anderson | Tao Yuan | Brian Murphy | Massimo Poesio
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

pdf bib
Relational Structures and Models for Coreference Resolution
Truc-Vien T. Nguyen | Massimo Poesio
Proceedings of COLING 2012: Posters

pdf bib
DeCour: a corpus of DEceptive statements in Italian COURts
Tommaso Fornaciari | Massimo Poesio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In criminal proceedings, sometimes it is not easy to evaluate the sincerity of oral testimonies. DECOUR - DEception in COURt corpus - has been built with the aim of training models suitable to discriminate, from a stylometric point of view, between sincere and deceptive statements. DECOUR is a collection of hearings held in four Italian Courts, in which the speakers lie in front of the judge. These hearings become the object of a specific criminal proceeding for calumny or false testimony, in which the deceptiveness of the statements of the defendant is ascertained. Thanks to the final Court judgment, that points out which lies are told, each utterance of the corpus has been annotated as true, uncertain or false, according to its degree of truthfulness. Since the judgment of deceptiveness follows a judicial inquiry, the annotation has been realized with a greater degree of confidence than ever before. Moreover, in Italy this is the first corpus of deceptive texts not relying on ‘mock' lies created in laboratory conditions, but which has been collected in a natural environment.

pdf bib
Domain-specific vs. Uniform Modeling for Coreference Resolution
Olga Uryupina | Massimo Poesio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Several corpora annotated for coreference have been made available in the past decade. These resources differ with respect to their size and the underlying structure: the number of domains and their similarity. Our study compares domain-specific models, learned from small heterogeneous subsets of the investigated corpora, against uniform models, that utilize all the available data. We show that for knowledge-poor baseline systems, domain-specific and uniform modeling yield same results. Systems, relying on large amounts of linguistic knowledge, however, exhibit differences in their performance: with all the designed features in use, domain-specific models suffer from over-fitting, whereas with pre-selected feature sets they tend to outperform union models.

2011

pdf bib
A Cross-Lingual ILP Solution to Zero Anaphora Resolution
Ryu Iida | Massimo Poesio
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Structure-Preserving Pipelines for Digital Libraries
Massimo Poesio | Eduard Barbu | Egon Stemle | Christian Girardi
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

pdf bib
Multi-metric optimization for coreference: The UniTN / IITP / Essex submission to the 2011 CONLL Shared Task
Olga Uryupina | Sriparna Saha | Asif Ekbal | Massimo Poesio
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf bib
Single and multi-objective optimization for feature selection in anaphora resolution
Sriparna Saha | Asif Ekbal | Olga Uryupina | Massimo Poesio
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
SemEval-2010 Task 1: Coreference Resolution in Multiple Languages
Marta Recasens | Lluís Màrquez | Emili Sapena | M. Antònia Martí | Mariona Taulé | Véronique Hoste | Massimo Poesio | Yannick Versley
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
BART: A Multilingual Anaphora Resolution System
Samuel Broscheit | Massimo Poesio | Simone Paolo Ponzetto | Kepa Joseba Rodriguez | Lorenza Romano | Olga Uryupina | Yannick Versley | Roberto Zanoli
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Detecting Semantic Category in Simultaneous EEG/MEG Recordings
Brian Murphy | Massimo Poesio
Proceedings of the NAACL HLT 2010 First Workshop on Computational Neurolinguistics

pdf bib
Proceedings of the Fourth Linguistic Annotation Workshop
Nianwen Xue | Massimo Poesio
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
Kepa Joseba Rodríguez | Francesca Delogu | Yannick Versley | Egon W. Stemle | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.

pdf bib
BabyExp: Constructing a Huge Multimodal Resource to Acquire Commonsense Knowledge Like Children Do
Massimo Poesio | Marco Baroni | Oswald Lanz | Alessandro Lenci | Alexandros Potamianos | Hinrich Schütze | Sabine Schulte im Walde | Luca Surian
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

There is by now widespread agreement that the most realistic way to construct the large-scale commonsense knowledge repositories required by natural language and artificial intelligence applications is by letting machines learn such knowledge from large quantities of data, like humans do. A lot of attention has consequently been paid to the development of increasingly sophisticated machine learning algorithms for knowledge extraction. However, the nature of the input that humans are exposed to while learning commonsense knowledge has received much less attention. The BabyExp project is collecting very dense audio and video recordings of the first 3 years of life of a baby. The corpus constructed in this way will be transcribed with automated techniques and made available to the research community. Moreover, techniques to extract commonsense conceptual knowledge incrementally from these multimodal data are also being explored within the project. The current paper describes BabyExp in general, and presents pilot studies on the feasibility of the automated audio and video transcriptions.

pdf bib
Extending BART to Provide a Coreference Resolution System for German
Samuel Broscheit | Simone Paolo Ponzetto | Yannick Versley | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a flexible toolkit-based approach to automatic coreference resolution on German text. We start with our previous work aimed at reimplementing the system from Soon et al. (2001) for English, and extend it to duplicate a version of the state-of-the-art proposal from Klenner and Ailloud (2009). Evaluation performed on a benchmarking dataset, namely the TueBa-D/Z corpus (Hinrichs et al., 2005b), shows that machine learning based coreference resolution can be robustly performed in a language other than English.

pdf bib
Creating a Coreference Resolution System for Italian
Massimo Poesio | Olga Uryupina | Yannick Versley
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper summarizes our work on creating a full-scale coreference resolution (CR) system for Italian, using BART ― an open-source modular CR toolkit initially developed for English corpora. We discuss our experiments on language-specific issues of the task. As our evaluation experiments show, a language-agnostic system (designed primarily for English) can achieve a performance level in high forties (MUC F-score) when re-trained and tested on a new language, at least on gold mention boundaries. Compared to this level, we can improve our F-score by around 10% introducing a small number of language-specific changes. This shows that, with a modular coreference resolution platform, such as BART, one can straightforwardly develop a family of robust and reliable systems for various languages. We hope that our experiments will encourage researchers working on coreference in other languages to create their own full-scale coreference resolution systems ― as we have mentioned above, at the moment such modules exist only for very few languages other than English.

2009

pdf bib
Evaluating Centering for Information Ordering Using Corpora
Nikiforos Karamanis | Chris Mellish | Massimo Poesio | Jon Oberlander
Computational Linguistics, Volume 35, Number 1, March 2009

pdf bib
Obituaries: Janet Hitzeman
Massimo Poesio | David Day | Inderjeet Mani
Computational Linguistics, Volume 35, Number 4, December 2009

pdf bib
Unsupervised Knowledge Extraction for Taxonomies of Concepts from Wikipedia
Eduard Barbu | Massimo Poesio
Proceedings of the International Conference RANLP-2009

pdf bib
EEG responds to conceptual stimuli and corpus semantics
Brian Murphy | Marco Baroni | Massimo Poesio
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Constructing an Anaphorically Annotated Corpus with Non-Experts: Assessing the Quality of Collaborative Annotations
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

pdf bib
Play your way to an annotated corpus: Games with a purpose and anaphoric annotation
Massimo Poesio
Proceedings of the Eight International Conference on Computational Semantics

pdf bib
Interactive Gesture in Dialogue: a PTT Model
Hannes Rieser | Massimo Poesio
Proceedings of the SIGDIAL 2009 Conference

pdf bib
State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes
Simone Paolo Ponzetto | Massimo Poesio
Tutorial Abstracts of ACL-IJCNLP 2009

2008

pdf bib
Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Semantics in Text Processing. STEP 2008 Conference Proceedings

pdf bib
BART: A Modular Toolkit for Coreference Resolution
Yannick Versley | Simone Paolo Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the ACL-08: HLT Demo Session

pdf bib
A Corpus for Cross-Document Co-reference
David Day | Janet Hitzeman | Michael Wick | Keith Crouch | Massimo Poesio
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes a newly created text corpus of news articles that has been annotated for cross-document co-reference. Being able to robustly resolve references to entities across document boundaries will provide a useful capability for a variety of tasks, ranging from practical information retrieval applications to challenging research in information extraction and natural language understanding. This annotated corpus is intended to encourage the development of systems that can more accurately address this problem. A manual annotation tool was developed that allowed the complete corpus to be searched for likely co-referring entity mentions. This corpus of 257K words links mentions of co-referent people, locations and organizations (subject to some additional constraints). Each of the documents had already been annotated for within-document co-reference by the LDC as part of the ACE series of evaluations. The annotation process was bootstrapped with a string-matching-based linking procedure, and we report on some of initial experimentation with the data. The cross-document linking information will be made publicly available.

pdf bib
Anaphoric Annotation in the ARRAU Corpus
Massimo Poesio | Ron Artstein
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Arrau is a new corpus annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from different genres: task-oriented dialogues from the Trains-91 and Trains-93 corpus, narratives from the English Pear Stories corpus, newspaper articles from the Wall Street Journal portion of the Penn Treebank, and mixed text from the Gnome corpus.

pdf bib
BART: A modular toolkit for coreference resolution
Yannick Versley | Simone Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to concentrate on a specific aspect of the problem. We present BART, a highly modular toolkit for developing coreference applications. In the Johns Hopkins workshop on using lexical and encyclopedic knowledge for entity disambiguation, the toolkit was used to extend a reimplementation of Soon et al.’s proposal with a variety of additional syntactic and knowledge-based features, and experiment with alternative resolution processes, preprocessing tools, and classifiers. BART has been released as open source software and is available from http://www.sfs.uni-tuebingen.de/~versley/BART

pdf bib
ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation
Massimo Poesio | Udo Kruschwitz | Jon Chamberlain
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to contribute to collaborative resource creation. AnaWiki is a recently started project that will develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).

pdf bib
Coreference Systems Based on Kernels Methods
Yannick Versley | Alessandro Moschitti | Massimo Poesio | Xiaofeng Yang
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Survey Article: Inter-Coder Agreement for Computational Linguistics
Ron Artstein | Massimo Poesio
Computational Linguistics, Volume 34, Number 4, December 2008

2007

pdf bib
Discovering contradicting protein-protein interactions in text
Olivia Sanchez | Massimo Poesio
Biological, translational, and clinical language processing

pdf bib
Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus
Kepa Joseba Rodríguez | Stefanie Dipper | Michael Götze | Massimo Poesio | Giuseppe Riccardi | Christian Raymond | Joanna Rabiega-Wiśniewska
Proceedings of the Linguistic Annotation Workshop

2006

pdf bib
An Anaphora Resolution-Based Anonymization Module
M. Poesio | M. A. Kabadjov | P. Goux | U. Kruschwitz | E. Bishop | L. Corti
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Growing privacy and security concerns mean there is an increasing need for data to be anonymized before being publically released. We present a module for anonymizing references implemented as part of the SQUAD tools for specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies. The tool is implemented on top of the GUITAR anaphoric resolver.

2005

pdf bib
Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference
James Pustejovsky | Adam Meyers | Martha Palmer | Massimo Poesio
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

pdf bib
The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account
Massimo Poesio | Ron Artstein
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

pdf bib
Identifying Concept Attributes Using a Classifier
Massimo Poesio | Abdulrahman Almuhareb
Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition

pdf bib
Improving LSA-based Summarization with Anaphora Resolution
Josef Steinberger | Mijail Kabadjov | Massimo Poesio | Olivia Sanchez-Graillet
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
Learning to Resolve Bridging References
Massimo Poesio | Rahul Mehta | Axel Maroudas | Janet Hitzeman
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Evaluating Centering-Based Metrics of Coherence
Nikiforos Karamanis | Massimo Poesio | Chris Mellish | Jon Oberlander
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Discourse Annotation and Semantic Annotation in the GNOME corpus
Massimo Poesio
Proceedings of the Workshop on Discourse Annotation

pdf bib
Discourse-New Detectors for Definite Description Resolution: A Survey and a Preliminary Proposal
Massimo Poesio | Olga Uryupina | Renata Vieira | Mijail Alexandrov-Kabadjov | Rodrigo Goulart
Proceedings of the Conference on Reference Resolution and Its Applications

pdf bib
The MATE/GNOME Proposals for Anaphoric Annotation, Revisited
Massimo Poesio
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004

pdf bib
Attribute-Based and Value-Based Clustering: An Evaluation
Abdulrahman Almuhareb | Massimo Poesio
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Identifying Broken Plurals in Unvowelised Arabic Tex
Abduelbaset Goweder | Massimo Poesio | Anne De Roeck | Jeff Reynolds
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Acquiring Bayesian Networks from Text
Olivia Sanchez-Graillet | Massimo Poesio
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
A General-Purpose, Off-the-shelf Anaphora Resolution Module: Implementation and Preliminary Evaluation
Massimo Poesio | Mijail A. Kabadjov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Centering: A Parametric Theory and Its Instantiations
Massimo Poesio | Rosemary Stevenson | Barbara Di Eugenio | Janet Hitzeman
Computational Linguistics, Volume 30, Number 3, September 2004

2003

pdf bib
Associative Descriptions and Salience: A Preliminary Investigation
Massimo Poesio
Proceedings of the 2003 EACL Workshop on The Computational Treatment of Anaphora

2002

pdf bib
Acquiring Lexical Knowledge for Anaphora Resolution
Massimo Poesio | Tomonori Ishikawa | Sabine Schulte im Walde | Renata Vieira
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

pdf bib
Corpus-based NP Modifier Generation
Hua Cheng | Massimo Poesio | Renate Henschel | Chris Mellish
Second Meeting of the North American Chapter of the Association for Computational Linguistics

2000

pdf bib
Semantic Annotation for Generation: Issues in Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms
Massimo Poesio
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

pdf bib
Modelling Grounding and Discourse Obligations Using Update Rules
Colin Matheson | Massimo Poesio | David Traum
1st Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results
Massimo Poesio
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

pdf bib
An Empirically-based System for Processing Definite Descriptions
Renata Vieira | Massimo Poesio
Computational Linguistics, Volume 26, Number 4, December 2000

pdf bib
Specifying the Parameters of Centering Theory: a Corpus-Based Evaluation using Text from Application-Oriented Domains
M. Poesio | H. Cheng | R. Henschel | J. Hitzeman | R. Kibble | R. Stevenson
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Pronominalization revisited
Renate Henschel | Hua Cheng | Massimo Poesio
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Corpus-based Development and Evaluation of a System for Processing Definite Descriptions
Renata Vieira | Massimo Poesio
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1999

pdf bib
The MATE meta-scheme for coreference in dialogues in multiple languages
M. Poesio | F. Bruneseaux | L. Romary
Towards Standards and Tools for Discourse Tagging

1998

pdf bib
A Corpus-based Investigation of Definite Description Use
Massimo Poesio | Renata Vieira
Computational Linguistics, Volume 24, Number 2, June 1998

pdf bib
Long Distance Pronominalisation and Global Focus
Janet Hitzeman | Massimo Poesio
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Long Distance Pronominalisation and Global Focus
Janet Hitzeman | Massimo Poesio
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

1997

pdf bib
Resolving bridging references in unrestricted text
Massimo Poesio | Renata Vieira | Simone Teufel
Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts

1996

pdf bib
Book Reviews: Logic and Lexicon
Massimo Poesio
Computational Linguistics, Volume 22, Number 1, March 1996

1993

pdf bib
Temporal Centering
Megumi Kameyama | Rebecca Passonneau | Massimo Poesio
31st Annual Meeting of the Association for Computational Linguistics

pdf bib
Assigning a Semantic Scope to Operators
Massimo Poesio
31st Annual Meeting of the Association for Computational Linguistics

Search
Co-authors