Massimo Poesio - ACL Anthology

Massimo Poesio

Also published as: M. Poesio

2025

Beyond Citations: Integrating Finding-Based Relations for Improved Biomedical Article Representations
Yuan Liang | Massimo Poesio | Roonak Rezvani
Proceedings of the 24th Workshop on Biomedical Language Processing

High-quality scientific article embeddings are essential for tasks like document retrieval, citation recommendation, and classification. Traditional citation-based approaches assume citations reflect semantic similarity—an assumption that introduces bias and noise. Recent models like SciNCL and SPECTER2 have attempted to refine citation-based representations but still struggle with noisy citation edges and fail to fully leverage textual information. To address these limitations, we propose a hybrid approach that combines Finding-Citation Graphs (FCG) with contrastive learning. Our method improves triplet selection by filtering out less important citations and incorporating finding similarity relations, leading to better semantic relationship capture. Evaluated on the SciRepEval benchmark, our approach consistently outperforms citation-only baselines, showing the value of text-based semantic structures. While we do not surpass state-of-the-art models in most tasks, our results reveal the limitations of purely citation-based embeddings and suggest paths for improvement through enhanced semantic integration and domain-specific adaptations.

Can LLMs Detect Ambiguous Plural Reference? An Analysis of Split-Antecedent and Mereological Reference
Dang Thi Thao Anh | Rick Nouwen | Massimo Poesio
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Our goal is to study how LLMs represent and interpret plural reference in ambiguous and unambiguous contexts. We ask the following research questions: (1) Do LLMs exhibit human-like preferences in representing plural reference? and (2) Are LLMs able to detect ambiguity in plural anaphoric expressions and identify possible referents? To address these questions, we design a set of experiments, examining pronoun production using next-token prediction tasks, pronoun interpretation, and ambiguity detection using different prompting strategies. We then assess how comparable LLMs are to humans in formulating and interpreting plural reference. We find that LLMs are sometimes aware of possible referents of ambiguous pronouns. However, they do not always follow human reference when choosing between interpretations, especially when the possible interpretation is not explicitly mentioned. In addition, they struggle to identify ambiguity without direct instruction. Our findings also reveal inconsistencies in the results across different types of experiments.

MLLMs Construction Company: Investigating Multimodal LLMs’ Communicative Skills in a Collaborative Building Task
Marika Sarzotti | Giovanni Duca | Chris Madge | Raffaella Bernardi | Massimo Poesio
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Michal Novak | Massimo Poesio | Sameer Pradhan | Vincent Ng
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

Referential ambiguity and clarification requests: comparing human and LLM behaviour
Chris Madge | Matthew Purver | Massimo Poesio
Proceedings of the Eighth Workshop on Computational Models of Reference, Anaphora and Coreference

In this work we examine LLMs’ ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus — one for reference and ambiguity in reference, and one for SDRT including clarifications — into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs’ ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.

Low-Hallucination and Efficient Coreference Resolution with LLMs
Yujian Gan | Yuan Liang | Jinxia Xie | Yanni Lin | Juntao Yu | Massimo Poesio
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) have shown promising results in coreference resolution, especially after fine-tuning. However, recent generative approaches face a critical issue: hallucinations—where the model generates content not present in the original input. These hallucinations make evaluation difficult and decrease overall performance. To address this issue, we analyze the underlying causes of hallucinations and propose a low-hallucination and efficient solution. Specifically, we introduce Efficient Constrained Decoding for Coreference Resolution, which maintains strong robustness while significantly improving computational efficiency. On the English OntoNotes development set, our approach achieved slightly better performance than previous state-of-the-art methods, while requiring substantially fewer parameters.

Comparing Eye-gaze and Transformer Attention Mechanisms in Reading Tasks
Maria Mouratidi | Massimo Poesio
Proceedings of the First International Workshop on Gaze Data and Natural Language Processing

As transformers become increasingly prevalent in NLP research, evaluating their cognitive alignment with human language processing has become essential for validating them as models of human language. This study compares eye-gaze patterns in human reading with transformer attention using different attention representations (raw attention, attention flow, gradient-based saliency). We employ both statistical correlation analysis and predictive modeling using PCA-reduced representations of eye-tracking features across two reading tasks. The findings reveal lower correlations and predictive capacity for the decoder model compared to the encoder model, with implications for the gap between behavioral performance and cognitive plausibility of different transformer designs.

Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation
Hadi Mohammadi | Tina Shahedi | Pablo Mosteiro | Massimo Poesio | Ayoub Bagheri | Anastasia Giachanou
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this influence, finding that while statistically present, demographic factors account for a minor fraction (~8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.

Not Just Who or What: Modeling the Interaction of Linguistic and Annotator Variation in Hateful Word Interpretation
Sanne Hoeken | Özge Alacam | Dong Nguyen | Massimo Poesio | Sina Zarrieß
Proceedings of the 16th International Conference on Computational Semantics

Interpreting whether a word is hateful in context is inherently subjective. While growing research in NLP recognizes the importance of annotation variation and moves beyond treating it as noise, most work focuses primarily on annotator-related factors, often overlooking the role of linguistic context and its interaction with individual interpretation.In this paper, we investigate the factors driving variation in hateful word meaning interpretation by extending the HateWiC dataset with linguistic and annotator-level features. Our empirical analysis shows that variation in annotations is not solely a function of who is interpreting or what is being interpreted, but of the interaction between the two. We evaluate how well models replicate the patterns of human variation. We find that incorporating annotator information can improve alignment with human disagreement but still underestimates it. Our findings further demonstrate that capturing interpretation variation requires modeling the interplay between annotators and linguistic content and that neither surface-level agreement nor predictive accuracy alone is sufficient for truly reflecting human variation.

Extracting Behaviors from German Clinical Interviews in Support of Autism Spectrum Diagnosis
Margareta A. Kulcsar | Ian Paul Grant | Massimo Poesio
Proceedings of the 16th International Conference on Computational Semantics

Accurate identification of behaviors is essential for diagnosing developmental disorders such as Autism Spectrum Disorder (ASD). We frame the extraction of behaviors from text as a specialized form of event extraction grounded in the TimeML framework and evaluate two approaches: a pipeline model and an end-to-end model that directly extracts behavior spans from raw text. We introduce two novel datasets: a new clinical annotation of an existing Reddit corpus of parent-authored posts in English and a clinically annotated corpus of German ASD diagnostic interviews. On the English dataset, the end-to-end BERT model achieved an F1 score of 73.4% in behavior classification, outperforming the pipeline models (F1: 66.8% and 53.65%). On the German clinical dataset, the end-to-end model reached an even higher F1 score of 80.1%, again outperforming the pipeline (F1: 78.7%) and approaching the gold-annotated upper bound (F1: 92.9%). These results demonstrate that behavior classification benefits from direct extraction, and that our method generalizes across domains and languages.

Hypernetworks for Perspectivist Adaptation
Daniil Ignatev | Denis Paperno | Massimo Poesio
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

The task of perspective-aware classification introduces a bottleneck in terms of parametric efficiency that did not get enough recognition in existing studies. In this article, we aim to address this issue by applying an existing architecture, the hypernetwork+adapters combination, to perspectivist classification. Ultimately, we arrive at a solution that can compete with specialized models in adopting user perspectives on hate speech and toxicity detection, while also making use of considerably fewer parameters. Our solution is architecture-agnostic and can be applied to a wide range of base models out of the box.

LeWiDi-2025 at NLPerspectives: Third Edition of the Learning with Disagreements Shared Task
Elisa Leonardelli | Silvia Casola | Siyao Peng | Giulia Rizzi | Valerio Basile | Elisabetta Fersini | Diego Frassinelli | Hyewon Jang | Maja Pavlovic | Barbara Plank | Massimo Poesio
Proceedings of the The 4th Workshop on Perspectivist Approaches to NLP

Many researchers have reached the conclusion that ai models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LeWiDi series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating ai models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LeWiDi benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LeWiDi as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

Exploring the Usage of Knowledge Graphs in Identifying Human and LLM-Generated Fake Reviews
Ming Liu | Massimo Poesio
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

The emergence of large language models has led to an explosion of machine-generated fake reviews. Although distinguishing between human and LLM-generated fake reviews is an area of active research, progress is still needed. One aspect which makes current LLM-generated fake reviews easier to recognize is that LLMs–in particular the smaller ones–lack domain-related knowledge. The objective of this work is to investigate whether large language models can produce more realistic artificial reviews when supplemented with knowledge graph information, thus resulting in a more challenging training dataset for human and LLM-generated fake reviews detectors. We propose a method for generating fake reviews by providing knowledge graph information to a llama model, and used it to generate a large number of fake reviews which used to fine tune a state-of-the-art human and LLM-generated fake reviews detection system. Our results show that when knowledge graph information is provided as part of the input, the accuracy of the model is improved by 0.24%. When the knowledge graph is used as an embedding layer and combined with the existing input embedding layer, the accuracy of the detection model is improved by 1.279%.

Annotator disagreement in RST annotation schemes
Daniil Ignatev | Denis Paperno | Massimo Poesio
Proceedings of the Society for Computation in Linguistics 2025

Improving LLMs’ Learning of Coreference Resolution
Yujian Gan | Yuan Liang | Yanni Lin | Juntao Yu | Massimo Poesio
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Coreference Resolution (CR) is crucial for many NLP tasks, but existing LLMs struggle with hallucination and under-performance. In this paper, we investigate the limitations of existing LLM-based approaches to CR—specifically the Question-Answering (QA) Template and Document Template methods—and propose two novel techniques: Reversed Training with Joint Inference and Iterative Document Generation. Our experiments show that Reversed Training improves the QA Template method, while Iterative Document Generation eliminates hallucinations in the generated source text and boosts coreference resolution. Integrating these methods and techniques offers an effective and robust solution to LLM-based coreference resolution

2024

A Fine-grained citation graph for biomedical academic papers: the finding-citation graph
Yuan Liang | Massimo Poesio | Roonak Rezvani
Proceedings of the 23rd Workshop on Biomedical Natural Language Processing

Citations typically mention findings as well as papers. To model this richer notion of citation, we introduce a richer form of citation graph with nodes for both academic papers and their findings: the finding-citation graph (FCG). We also present a new pipeline to construct such a graph, which includes a finding identification module and a citation sentence extraction module. From each paper, it extracts rich basic information, abstract, and structured full text first. The abstract and vital sections, such as the results and discussion, are input into the finding identification module. This module identifies multiple findings from a paper, achieving an 80% accuracy in multiple findings evaluation. The full text is input into the citation sentence extraction module to identify inline citation sentences and citation markers, achieving 97.7% accuracy. Then, the graph is constructed using the outputs from the two modules mentioned above. We used the Europe PMC to build such a graph using the pipeline, resulting in a graph with 14.25 million nodes and 76 million edges.

Polysemy—Evidence from Linguistics, Behavioral Science, and Contextualized Language Models
Janosch Haber | Massimo Poesio
Computational Linguistics, Volume 50, Issue 1 - March 2024

Polysemy is the type of lexical ambiguity where a word has multiple distinct but related interpretations. In the past decade, it has been the subject of a great many studies across multiple disciplines including linguistics, psychology, neuroscience, and computational linguistics, which have made it increasingly clear that the complexity of polysemy precludes simple, universal answers, especially concerning the representation and processing of polysemous words. But fuelled by the growing availability of large, crowdsourced datasets providing substantial empirical evidence; improved behavioral methodology; and the development of contextualized language models capable of encoding the fine-grained meaning of a word within a given context, the literature on polysemy recently has developed more complex theoretical analyses. In this survey we discuss these recent contributions to the investigation of polysemy against the backdrop of a long legacy of research across multiple decades and disciplines. Our aim is to bring together different perspectives to achieve a more complete picture of the heterogeneity and complexity of the phenomenon of polysemy. Specifically, we highlight evidence supporting a range of hybrid models of the mental processing of polysemes. These hybrid models combine elements from different previous theoretical approaches to explain patterns and idiosyncrasies in the processing of polysemous that the best known models so far have failed to account for. Our literature review finds that (i) traditional analyses of polysemy can be limited in their generalizability by loose definitions and selective materials; (ii) linguistic tests provide useful evidence on individual cases, but fail to capture the full range of factors involved in the processing of polysemous sense extensions; and (iii) recent behavioral (psycho) linguistics studies, large-scale annotation efforts, and investigations leveraging contextualized language models provide accumulating evidence suggesting that polysemous sense similarity covers a wide spectrum between identity of sense and homonymy-like unrelatedness of meaning. We hope that the interdisciplinary account of polysemy provided in this survey inspires further fundamental research on the nature of polysemy and better equips applied research to deal with the complexity surrounding the phenomenon, for example, by enabling the development of benchmarks and testing paradigms for large language models informed by a greater portion of the rich evidence on the phenomenon currently available.

The ARRAU 3.0 Corpus
Massimo Poesio | Maris Camilleri | Paloma Carretero Garcia | Juntao Yu | Mark-Christoph Müller
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)

The ARRAU corpus is an anaphorically annotated corpus designed to cover a wide variety of aspects of anaphoric reference in a variety of genres, including both written text and spoken language. The objective of this annotation project is to push forward the state of the art in anaphoric annotation, by overcoming the limitations of current annotation practice and the scope of current models of anaphoric interpretation, which in turn may reveal other issues. The resulting corpus is still therefore very much a work in progress almost twenty years after the project started. In this paper, we discuss the issues identified with the coding scheme used for the previous release, ARRAU 2, and through the use of this corpus for three shared tasks; the proposed solutions to these issues; and the resulting corpus, ARRAU 3.

Proceedings of the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Anna Nedoluzhko | Massimo Poesio | Sameer Pradhan | Vincent Ng
Proceedings of the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference

Using In-context Learning to Automate AI Image Generation for a Gamified Text Labelling Task
Fatima Althani | Chris Madge | Massimo Poesio
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

This paper explores a novel automated method to produce AI-generated images for a text-labelling gamified task. By leveraging the in-context learning capabilities of GPT-4, we automate the optimisation of text-to-image prompts to align with the text being labelled in the part-of-speech tagging task. As an initial evaluation, we compare the optimised prompts to the original sentences based on imageability and concreteness scores. Our results revealed that optimised prompts had significantly higher imageability and concreteness scores. Moreover, to evaluate text-to-image outputs, we generate images using Stable Diffusion XL based on the two prompt types, optimised prompts and the original sentences. Using the automated LIAON-Aesthetic predictor model, we assigned aesthetic scores for the generated images. This resulted in the outputs using optimised prompts scoring significantly higher in predicted aesthetics than those using original sentences as prompts. Our preliminary findings suggest that this methodology provides significantly more aesthetic text-to-image outputs than using the original sentence as a prompt. While the initial results are promising, the text labelling task and AI-generated images presented in this paper have yet to undergo human evaluation.

Linguistic Acceptability and Usability Enhancement: A Case Study of GWAP Evaluation and Redesign
Wateen Abdullah Aliady | Massimo Poesio
Proceedings of the 10th Workshop on Games and Natural Language Processing @ LREC-COLING 2024

Collecting high-quality annotations for Natural Language Processing (NLP) tasks poses challenges. Gamified annotation systems, like Games-with-a-Purpose (GWAP), have become popular tools for data annotation. For GWAPs to be effective, they must be user-friendly and produce high-quality annotations to ensure the collected data’s usefulness. This paper investigates the effectiveness of a gamified approach through two specific studies on an existing GWAP designed for collecting NLP coreference judgments. The first study involved preliminary usability testing using the concurrent think-aloud method to gather open-ended feedback. This feedback was crucial in pinpointing design issues. Following this, we conducted semi-structured interviews with our participants, and the insights collected from these interviews were instrumental in crafting player personas, which informed design improvements aimed at enhancing user experience. The outcomes of our research have been generalized to benefit other GWAP implementations. The second study evaluated the linguistic acceptability and reliability of the data collected through our GWAP. Our findings indicate that our GWAP produced reliable corpora with 91.49% accuracy and 0.787 Cohen’s kappa.

Assessing the Capabilities of Large Language Models in Coreference: An Evaluation
Yujian Gan | Massimo Poesio | Juntao Yu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper offers a nuanced examination of the role Large Language Models (LLMs) play in coreference resolution, aimed at guiding the future direction in the era of LLMs. We carried out both manual and automatic analyses of different LLMs’ abilities, employing different prompts to examine the performance of different LLMs, obtaining a comprehensive view of their strengths and weaknesses. We found that LLMs show exceptional ability in understanding coreference. However, harnessing this ability to achieve state of the art results on traditional datasets and benchmarks isn’t straightforward. Given these findings, we propose that future efforts should: (1) Improve the scope, data, and evaluation methods of traditional coreference research to adapt to the development of LLMs. (2) Enhance the fine-grained language understanding capabilities of LLMs.

Conceptual Pacts for Reference Resolution Using Small, Dynamically Constructed Language Models: A Study in Puzzle Building Dialogues
Julian Hough | Sina Zarrieß | Casey Kennington | David Schlangen | Massimo Poesio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Using Brennan and Clark’s theory of a Conceptual Pact, that when interlocutors agree on a name for an object, they are forming a temporary agreement on how to conceptualize that object, we present an extension to a simple reference resolver which simulates this process over time with different conversation pairs. In a puzzle construction domain, we model pacts with small language models for each referent which update during the interaction. When features from these pact models are incorporated into a simple bag-of-words reference resolver, the accuracy increases compared to using a standard pre-trained model. The model performs equally to a competitor using the same data but with exhaustive re-training after each prediction, while also being more transparent, faster and less resource-intensive. We also experiment with reducing the number of training interactions, and can still achieve reference resolution accuracies of over 80% in testing from observing a single previous interaction, over 20% higher than a pre-trained baseline. While this is a limited domain, we argue the model could be applicable to larger real-world applications in human and human-robot interaction and is an interpretable and transparent model.

Universal Anaphora: The First Three Years
Massimo Poesio | Maciej Ogrodniczuk | Vincent Ng | Sameer Pradhan | Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Amir Zeldes | Anna Nedoluzhko | Michal Novák | Martin Popel | Zdeněk Žabokrtský | Daniel Zeman
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, delivering datasets encoded according to these standards, and developing methods for evaluating models that carry out this type of interpretation. Although several papers on aspects of the initiative have appeared, no overall description of the initiative’s goals, proposals and achievements has been published yet except as an online draft. This paper aims to fill this gap, as well as to discuss its progress so far.

Soft metrics for evaluation with disagreements: an assessment
Giulia Rizzi | Elisa Leonardelli | Massimo Poesio | Alexandra Uma | Maja Pavlovic | Silviu Paun | Paolo Rosso | Elisabetta Fersini
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

The move towards preserving judgement disagreements in NLP requires the identification of adequate evaluation metrics. We identify a set of key properties that such metrics should have, and assess the extent to which natural candidates for soft evaluation such as Cross Entropy satisfy such properties. We employ a theoretical framework, supported by a visual approach, by practical examples, and by the analysis of a real case scenario. Our results indicate that Cross Entropy can result in fairly paradoxical results in some cases, whereas other measures Manhattan distance and Euclidean distance exhibit a more intuitive behavior, at least for the case of binary classification.

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation
Maja Pavlovic | Massimo Poesio
Proceedings of the 3rd Workshop on Perspectivist Approaches to NLP (NLPerspectives) @ LREC-COLING 2024

Recent studies focus on exploring the capability of Large Language Models (LLMs) for data annotation. Our work, firstly, offers a comparative overview of twelve such studies that investigate labelling with LLMs, particularly focusing on classification tasks. Secondly, we present an empirical analysis that examines the degree of alignment between the opinion distributions returned by GPT and those provided by human annotators across four subjective datasets. Our analysis supports a minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

Analyzing and Enhancing Clarification Strategies for Ambiguous References in Consumer Service Interactions
Changling Li | Yujian Gan | Zhenrong Yang | Youyang Chen | Xinxuan Qiu | Yanni Lin | Matthew Purver | Massimo Poesio
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

When customers present ambiguous references, service staff typically need to clarify the customers’ specific intentions. To advance research in this area, we collected 1,000 real-world consumer dialogues with ambiguous references. This dataset will be used for subsequent studies to identify ambiguous references and generate responses. Our analysis of the dataset revealed common strategies employed by service staff, including directly asking clarification questions (CQ) and listing possible options before asking a clarification question (LCQ). However, we found that merely using CQ often fails to fully satisfy customers. In contrast, using LCQ, as well as recommending specific products after listing possible options, proved more effective in resolving ambiguous references and enhancing customer satisfaction.

Polysemy through the lens of psycholinguistic variables: a dataset and an evaluation of static and contextualized language models
Andrea Bruera | Farbod Zamani | Massimo Poesio
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Polysemes are words that can have different senses depending on the context of utterance: for instance, ‘newspaper’ can refer to an organization (as in ‘manage the newspaper’) or to an object (as in ‘open the newspaper’). Contrary to a large body of evidence coming from psycholinguistics, polysemy has been traditionally modelled in NLP by assuming that each sense should be given a separate representation in a lexicon (e.g. WordNet). This led to the current situation, where datasets used to evaluate the ability of computational models of semantics miss crucial details about the representation of polysemes, thus limiting the amount of evidence that can be gained from their use. In this paper we propose a framework to approach polysemy as a continuous variation in psycholinguistic properties of a word in context. This approach accommodates different sense interpretations, without postulating clear-cut jumps between senses. First we describe a publicly available English dataset that we collected, where polysemes in context (verb-noun phrases) are annotated for their concreteness and body sensory strength. Then, we evaluate static and contextualized language models in their ability to predict the ratings of each polyseme in context, as well as in their ability to capture the distinction among senses, revealing and characterizing in an interpretable way the models’ flaws.

2023

Proceedings of the Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)
Maciej Ogrodniczuk | Vincent Ng | Sameer Pradhan | Massimo Poesio
Proceedings of the Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)

Aggregating Crowdsourced and Automatic Judgments to Scale Up a Corpus of Anaphoric Reference for Fiction and Wikipedia Texts
Juntao Yu | Silviu Paun | Maris Camilleri | Paloma Garcia | Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Although several datasets annotated for anaphoric reference / coreference exist, even the largest such datasets have limitations in term of size, range of domains, coverage of anaphoric phenomena, and size of documents included. Yet, the approaches proposed to scale up anaphoric annotation haven’t so far resulted in datasets overcoming these limitations. In this paper, we introduce a new release of a corpus for anaphoric reference labelled via a game-with-a-purpose. This new release is comparable in size to the largest existing corpora for anaphoric reference due in part to substantial activity by the players, in part thanks to the use of a new resolve-and-aggregate paradigm to ‘complete’ markable annotations through the combination of an anaphoric resolver and an aggregation method for anaphoric reference. The proposed method could be adopted to greatly speed up annotation time in other projects involving games-with-a-purpose. In addition, the corpus covers genres for which no comparable size datasets exist (Fiction and Wikipedia); it covers singletons and non-referring expressions; and it includes a substantial number of long documents ( 2K in length).

The Universal Anaphora Scorer 2.0
Juntao Yu | Michal Novák | Abdulrahman Aloraini | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio
Proceedings of the 15th International Conference on Computational Semantics

The aim of the Universal Anaphora initiative is to push forward the state of the art both in anaphora (coreference) annotation and in the evaluation of models for anaphora resolution. The first release of the Universal Anaphora Scorer (Yu et al., 2022b) supported the scoring not only of identity anaphora as in the Reference Coreference Scorer (Pradhan et al., 2014) but also of split antecedent anaphoric reference, bridging references, and discourse deixis. That scorer was used in the CODI-CRAC 2021/2022 Shared Tasks on Anaphora Resolution in Dialogues (Khosla et al., 2021; Yu et al., 2022a). A modified version of the scorer supporting discontinuous markables and the COREFUD markup format was also used in the CRAC 2022 Shared Task on Multilingual Coreference Resolution (Zabokrtsky et al., 2022). In this paper, we introduce the second release of the scorer, merging the two previous versions, which can score reference with discontinuous markables and zero anaphora resolution.

Data Augmentation for Fake Reviews Detection
Ming Liu | Massimo Poesio
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

In this research, we studied the relationship between data augmentation and model accuracy for the task of fake review detection. We used data generation methods to augment two different fake review datasets and compared the performance of models trained with the original data and with the augmented data. Our results show that the accuracy of our fake review detection model can be improved by 0.31 percentage points on DeRev Test and by 7.65 percentage points on Amazon Test by using the augmented datasets.

SemEval-2023 Task 11: Learning with Disagreements (LeWiDi)
Elisa Leonardelli | Gavin Abercrombie | Dina Almanea | Valerio Basile | Tommaso Fornaciari | Barbara Plank | Verena Rieser | Alexandra Uma | Massimo Poesio
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

NLP datasets annotated with human judgments are rife with disagreements between the judges. This is especially true for tasks depending on subjective judgments such as sentiment analysis or offensive language detection. Particularly in these latter cases, the NLP community has come to realize that the common approach of reconciling’ these different subjective interpretations risks misrepresenting the evidence. Many NLP researchers have therefore concluded that rather than eliminating disagreements from annotated corpora, we should preserve themindeed, some argue that corpora should aim to preserve all interpretations produced by annotators. But this approach to corpus creation for NLP has not yet been widely accepted. The objective of the Le-Wi-Di series of shared tasks is to promote this approach to developing NLP models by providing a unified framework for training and evaluating with such datasets. We report on the second such shared task, which differs from the first edition in three crucial respects: (i) it focuses entirely on NLP, instead of both NLP and computer vision tasks in its first edition; (ii) it focuses on subjective tasks, instead of covering different types of disagreements as training with aggregated labels for subjective NLP tasks is in effect a misrepresentation of the data; and (iii) for the evaluation, we concentrated on soft approaches to evaluation. This second edition of Le-Wi-Di attracted a wide array of partici- pants resulting in 13 shared task submission papers.

2022

Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa
Tommaso Fornaciari | Alexandra Uma | Massimo Poesio | Dirk Hovy
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Natural Language Processing (NLP) ‘s applied nature makes it necessary to select the most effective and robust models. Producing slightly higher performance is insufficient; we want to know whether this advantage will carry over to other data sets. Bootstrapped significance tests can indicate that ability. So while necessary, computing the significance of models’ performance differences has many levels of complexity. It can be tedious, especially when the experimental design has many conditions to compare and several runs of experiments. We present BooStSa, a tool that makes it easy to compute significance levels with the BOOtSTrap SAmpling procedure to evaluate models that predict not only standard hard labels but soft-labels (i.e., probability distributions over different classes) as well.

Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rose
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Juntao Yu | Sopan Khosla | Ramesh Manuvinakurike | Lori Levin | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2022 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2022 Shared Task on Anaphora Resolution in Dialogues is the second edition of an initiative focused on detecting different types of anaphoric relations in conversations of different kinds. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple tasks focusing individually on these key relations. The second edition of the shared task maintained the focus on these relations and used the same datasets as in 2021, but new test data were annotated, the 2021 data were checked, and new subtasks were added. In this paper, we discuss the annotation schemes, the datasets, the evaluation scripts used to assess the system performance on these tasks, and provide a brief summary of the participating systems and the results obtained across 230 runs from three teams, with most submissions achieving significantly better results than our baseline methods.

Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Sameer Pradhan | Anna Nedoluzhko | Vincent Ng | Massimo Poesio
Proceedings of the Fifth Workshop on Computational Models of Reference, Anaphora and Coreference

Less Text, More Visuals: Evaluating the Onboarding Phase in a GWAP for NLP
Fatima Althani | Chris Madge | Massimo Poesio
Proceedings of the 9th Workshop on Games and Natural Language Processing within the 13th Language Resources and Evaluation Conference

Games-with-a-purpose find attracting players a challenge. To improve player recruitment, we explored two game design elements that can increase player engagement during the onboarding phase; a narrative and a tutorial. In a qualitative study with 12 players of linguistic and language learning games, we examined the effect of presentation format on players’ engagement. Our reflexive thematic analysis found that in the onboarding phase of a GWAP for NLP, presenting players with visuals is expected and pre- senting too much text overwhelms them. Furthermore, players found that the instructions they were presented with lacked linguistic context. Additionally, the tutorial and game interface required refinement as the feedback is unsupportive and the graphics were not clear.

ArMIS - The Arabic Misogyny and Sexism Corpus with Annotator Subjective Disagreements
Dina Almanea | Massimo Poesio
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The use of misogynistic and sexist language has increased in recent years in social media, and is increasing in the Arabic world in reaction to reforms attempting to remove restrictions on women lives. However, there are few benchmarks for Arabic misogyny and sexism detection, and in those the annotations are in aggregated form even though misogyny and sexism judgments are found to be highly subjective. In this paper we introduce an Arabic misogyny and sexism dataset (ArMIS) characterized by providing annotations from annotators with different degree of religious beliefs, and provide evidence that such differences do result in disagreements. To the best of our knowledge, this is the first dataset to study in detail the effect of beliefs on misogyny and sexism annotation. We also discuss proof-of-concept experiments showing that a dataset in which disagreements have not been reconciled can be used to train state-of-the-art models for misogyny and sexism detection; and consider different ways in which such models could be evaluated.

The Universal Anaphora Scorer
Juntao Yu | Sopan Khosla | Nafise Sadat Moosavi | Silviu Paun | Sameer Pradhan | Massimo Poesio
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The aim of the Universal Anaphora initiative is to push forward the state of the art in anaphora and anaphora resolution by expanding the aspects of anaphoric interpretation which are or can be reliably annotated in anaphoric corpora, producing unified standards to annotate and encode these annotations, deliver datasets encoded according to these standards, and developing methods for evaluating models carrying out this type of interpretation. Such expansion of the scope of anaphora resolution requires a comparable expansion of the scope of the scorers used to evaluate this work. In this paper, we introduce an extended version of the Reference Coreference Scorer (Pradhan et al., 2014) that can be used to evaluate the extended range of anaphoric interpretation included in the current Universal Anaphora proposal. The UA scorer supports the evaluation of identity anaphora resolution and of bridging reference resolution, for which scorers already existed but not integrated in a single package. It also supports the evaluation of split antecedent anaphora and discourse deixis, for which no tools existed. The proposed approach to the evaluation of split antecedent anaphora is entirely novel; the proposed approach to the evaluation of discourse deixis leverages the encoding of discourse deixis proposed in Universal Anaphora to enable the use for discourse deixis of the same metrics already used for identity anaphora. The scorer was tested in the recent CODI-CRAC 2021 Shared Task on Anaphora Resolution in Dialogues.

Joint Coreference Resolution for Zeros and non-Zeros in Arabic
Abdulrahman Aloraini | Sameer Pradhan | Massimo Poesio
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Most existing proposals about anaphoric zero pronoun (AZP) resolution regard full mention coreference and AZP resolution as two independent tasks, even though the two tasks are clearly related. The main issues that need tackling to develop a joint model for zero and non-zero mentions are the difference between the two types of arguments (zero pronouns, being null, provide no nominal information) and the lack of annotated datasets of a suitable size in which both types of arguments are annotated for languages other than Chinese and Japanese. In this paper, we introduce two architectures for jointly resolving AZPs and non-AZPs, and evaluate them on Arabic, a language for which, as far as we know, there has been no prior work on joint resolution. Doing this also required creating a new version of the Arabic subset of the standard coreference resolution dataset used for the CoNLL-2012 shared task (Pradhan et al.,2012) in which both zeros and non-zeros are included in a single dataset.

Coreference Annotation of an Arabic Corpus using a Virtual World Game
Wateen Abdullah Aliady | Abdulrahman Aloraini | Christopher Madge | Juntao Yu | Richard Bartle | Massimo Poesio
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Coreference resolution is a key aspect of text comprehension, but the size of the available coreference corpora for Arabic is limited in comparison to the size of the corpora for other languages. In this paper we present a Game-With-A-Purpose called Stroll with a Scroll created to collect from players coreference annotations for Arabic. The key contribution of this work is the embedding of the annotation task in a virtual world setting, as opposed to the puzzle-type games used in previously proposed Games-With-A-Purpose for coreference.

2021

We Need to Consider Disagreement in Evaluation
Valerio Basile | Michael Fell | Tommaso Fornaciari | Dirk Hovy | Silviu Paun | Barbara Plank | Massimo Poesio | Alexandra Uma
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

Evaluation is of paramount importance in data-driven research fields such as Natural Language Processing (NLP) and Computer Vision (CV). Current evaluation practice largely hinges on the existence of a single “ground truth” against which we can meaningfully compare the prediction of a model. However, this comparison is flawed for two reasons. 1) In many cases, more than one answer is correct. 2) Even where there is a single answer, disagreement among annotators is ubiquitous, making it difficult to decide on a gold standard. We argue that the current methods of adjudication, agreement, and evaluation need serious reconsideration. Some researchers now propose to minimize disagreement and to fix datasets. We argue that this is a gross oversimplification, and likely to conceal the underlying complexity. Instead, we suggest that we need to better capture the sources of disagreement to improve today’s evaluation practice. We discuss three sources of disagreement: from the annotator, the data, and the context, and show how this affects even seemingly objective tasks. Datasets with multiple annotations are becoming more common, as are methods to integrate disagreement into modeling. The logical next step is to extend this to evaluation.

Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

The CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue
Sopan Khosla | Juntao Yu | Ramesh Manuvinakurike | Vincent Ng | Massimo Poesio | Michael Strube | Carolyn Rosé
Proceedings of the CODI-CRAC 2021 Shared Task on Anaphora, Bridging, and Discourse Deixis in Dialogue

In this paper, we provide an overview of the CODI-CRAC 2021 Shared-Task: Anaphora Resolution in Dialogue. The shared task focuses on detecting anaphoric relations in different genres of conversations. Using five conversational datasets, four of which have been newly annotated with a wide range of anaphoric relations: identity, bridging references and discourse deixis, we defined multiple subtasks focusing individually on these key relations. We discuss the evaluation scripts used to assess the system performance on these subtasks, and provide a brief summary of the participating systems and the results obtained across ?? runs from 5 teams, with most submissions achieving significantly better results than our baseline methods.

Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Sameer Pradhan | Massimo Poesio | Yulia Grishina | Vincent Ng
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

Coreference Resolution for the Biomedical Domain: A Survey
Pengcheng Lu | Massimo Poesio
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.

Data Augmentation Methods for Anaphoric Zero Pronouns
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

In pro-drop language like Arabic, Chinese, Italian, Japanese, Spanish, and many others, unrealized (null) arguments in certain syntactic positions can refer to a previously introduced entity, and are thus called anaphoric zero pronouns. The existing resources for studying anaphoric zero pronoun interpretation are however still limited. In this paper, we use five data augmentation methods to generate and detect anaphoric zero pronouns automatically. We use the augmented data as additional training materials for two anaphoric zero pronoun systems for Arabic. Our experimental results show that data augmentation improves the performance of the two systems, surpassing the state-of-the-art results.

BERTective: Language Models and Contextual Information for Deception Detection
Tommaso Fornaciari | Federico Bianchi | Massimo Poesio | Dirk Hovy
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Spotting a lie is challenging but has an enormous potential impact on security as well as private and public safety. Several NLP methods have been proposed to classify texts as truthful or deceptive. In most cases, however, the target texts’ preceding context is not considered. This is a severe limitation, as any communication takes place in context, not in a vacuum, and context can help to detect deception. We study a corpus of Italian dialogues containing deceptive statements and implement deep neural models that incorporate various linguistic contexts. We establish a new state-of-the-art identifying deception and find that not all context is equally useful to the task. Only the texts closest to the target, if from the same speaker (rather than questions by an interlocutor), boost performance. We also find that the semantic information in language models such as BERT contributes to the performance. However, BERT alone does not capture the implicit knowledge of deception cues: its contribution is conditional on the concurrent use of attention to learn cues from BERT’s representations.

Patterns of Polysemy and Homonymy in Contextualised Language Models
Janosch Haber | Massimo Poesio
Findings of the Association for Computational Linguistics: EMNLP 2021

One of the central aspects of contextualised language models is that they should be able to distinguish the meaning of lexically ambiguous words by their contexts. In this paper we investigate the extent to which the contextualised embeddings of word forms that display multiplicity of sense reflect traditional distinctions of polysemy and homonymy. To this end, we introduce an extended, human-annotated dataset of graded word sense similarity and co-predication acceptability, and evaluate how well the similarity of embeddings predicts similarity in meaning. Both types of human judgements indicate that the similarity of polysemic interpretations falls in a continuum between identity of meaning and homonymy. However, we also observe significant differences within the similarity ratings of polysemes, forming consistent patterns for different types of polysemic sense alternation. Our dataset thus appears to capture a substantial part of the complexity of lexical ambiguity, and can provide a realistic test bed for contextualised embeddings. Among the tested models, BERT Large shows the strongest correlation with the collected word sense similarity ratings, but struggles to consistently replicate the observed similarity patterns. When clustering ambiguous word forms based on their embeddings, the model displays high confidence in discerning homonyms and some types of polysemic alternations, but consistently fails for others.

Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning
Tommaso Fornaciari | Alexandra Uma | Silviu Paun | Barbara Plank | Dirk Hovy | Massimo Poesio
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information: in addition to the standard error computation, we use soft-labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft-labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities, and thereby mitigates overfitting. It significantly improves performance across tasks, beyond the standard approach and prior work.

Stay Together: A System for Single and Split-antecedent Anaphora Resolution
Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Massimo Poesio
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The state-of-the-art on basic, single-antecedent anaphora has greatly improved in recent years. Researchers have therefore started to pay more attention to more complex cases of anaphora such as split-antecedent anaphora, as in “Time-Warner is considering a legal challenge to Telecommunications Inc’s plan to buy half of Showtime Networks Inc–a move that could lead to all-out war between the two powerful companies”. Split-antecedent anaphora is rarer and more complex to resolve than single-antecedent anaphora; as a result, it is not annotated in many datasets designed to test coreference, and previous work on resolving this type of anaphora was carried out in unrealistic conditions that assume gold mentions and/or gold split-antecedent anaphors are available. These systems also focus on split-antecedent anaphors only. In this work, we introduce a system that resolves both single and split-antecedent anaphors, and evaluate it in a more realistic setting that uses predicted mentions. We also start addressing the question of how to evaluate single and split-antecedent anaphors together using standard coreference evaluation metrics.

SemEval-2021 Task 12: Learning with Disagreements
Alexandra Uma | Tommaso Fornaciari | Anca Dumitrache | Tristan Miller | Jon Chamberlain | Barbara Plank | Edwin Simpson | Massimo Poesio
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

Disagreement between coders is ubiquitous in virtually all datasets annotated with human judgements in both natural language processing and computer vision. However, most supervised machine learning methods assume that a single preferred interpretation exists for each item, which is at best an idealization. The aim of the SemEval-2021 shared task on learning with disagreements (Le-Wi-Di) was to provide a unified testing framework for methods for learning from data containing multiple and possibly contradictory annotations covering the best-known datasets containing information about disagreements for interpreting language and classifying images. In this paper we describe the shared task and its results.

2020

Named Entity Recognition as Dependency Parsing
Juntao Yu | Bernd Bohnet | Massimo Poesio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing, concerned with identifying spans of text expressing references to entities. NER research is often focused on flat entities only (flat NER), ignoring the fact that entity references can be nested, as in [Bank of [China]] (Finkel and Manning, 2009). In this paper, we use ideas from graph-based dependency parsing to provide our model a global view on the input via a biaffine model (Dozat and Manning, 2017). The biaffine model scores pairs of start and end tokens in a sentence which we use to explore all spans, so that the model is able to predict named entities accurately. We show that the model works well for both nested and flat NER through evaluation on 8 corpora and achieving SoTA performance on all of them, with accuracy gains of up to 2.2 percentage points.

Speaking Outside the Box: Exploring the Benefits of Unconstrained Input in Crowdsourcing and Citizen Science Platforms
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

Crowdsourcing approaches provide a difficult design challenge for developers. There is a trade-off between the efficiency of the task to be done and the reward given to the user for participating, whether it be altruism, social enhancement, entertainment or money. This paper explores how crowdsourcing and citizen science systems collect data and complete tasks, illustrated by a case study from the online language game-with-a-purpose Phrase Detectives. The game was originally developed to be a constrained interface to prevent player collusion, but subsequently benefited from posthoc analysis of over 76k unconstrained inputs from users. Understanding the interface design and task deconstruction are critical for enabling users to participate in such systems and the paper concludes with a discussion of the idea that social networks can be viewed as form of citizen science platform with both constrained and unconstrained inputs making for a highly complex dataset.

Multitask Learning-Based Neural Bridging Reference Resolution
Juntao Yu | Massimo Poesio
Proceedings of the 28th International Conference on Computational Linguistics

We propose a multi task learning-based neural model for resolving bridging references tackling two key challenges. The first challenge is the lack of large corpora annotated with bridging references. To address this, we use multi-task learning to help bridging reference resolution with coreference resolution. We show that substantial improvements of up to 8 p.p. can be achieved on full bridging resolution with this architecture. The second challenge is the different definitions of bridging used in different corpora, meaning that hand-coded systems or systems using special features designed for one corpus do not work well with other corpora. Our neural model only uses a small number of corpus independent features, thus can be applied to different corpora. Evaluations with very different bridging corpora (ARRAU, ISNOTES, BASHI and SCICORP) suggest that our architecture works equally well on all corpora, and achieves the SoTA results on full bridging resolution for all corpora, outperforming the best reported results by up to 36.3 p.p..

Free the Plural: Unrestricted Split-Antecedent Anaphora Resolution
Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Massimo Poesio
Proceedings of the 28th International Conference on Computational Linguistics

Now that the performance of coreference resolvers on the simpler forms of anaphoric reference has greatly improved, more attention is devoted to more complex aspects of anaphora. One limitation of virtually all coreference resolution models is the focus on single-antecedent anaphors. Plural anaphors with multiple antecedents-so-called split-antecedent anaphors (as in John met Mary. They went to the movies) have not been widely studied, because they are not annotated in ONTONOTES and are relatively infrequent in other corpora. In this paper, we introduce the first model for unrestricted resolution of split-antecedent anaphors. We start with a strong baseline enhanced by BERT embeddings, and show that we can substantially improve its performance by addressing the sparsity issue. To do this, we experiment with auxiliary corpora where split-antecedent anaphors were annotated by the crowd, and with transfer learning models using element-of bridging references and single-antecedent coreference as auxiliary tasks. Evaluation on the gold annotated ARRAU corpus shows that the out best model uses a combination of three auxiliary corpora achieved F1 scores of 70% and 43.6% when evaluated in a lenient and strict setting, respectively, i.e., 11 and 21 percentage points gain when compared with our baseline.

Anaphoric Zero Pronoun Identification: A Multilingual Approach
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

Pro-drop languages such as Arabic, Chinese, Italian or Japanese allow morphologically null but referential arguments in certain syntactic positions, called anaphoric zero-pronouns. Much NLP work on anaphoric zero-pronouns (AZP) is based on gold mentions, but models for their identification are a fundamental prerequisite for their resolution in real-life applications. Such identification requires complex language understanding and knowledge of real-world entities. Transfer learning models, such as BERT, have recently shown to learn surface, syntactic, and semantic information,which can be very useful in recognizing AZPs. We propose a BERT-based multilingual model for AZP identification from predicted zero pronoun positions, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, this is the first neural network model of AZP identification for Arabic; and our approach outperforms the stateof-the-art for Chinese. Experiment results suggest that BERT implicitly encode information about AZPs through their surrounding context.

Neural Coreference Resolution for Arabic
Abdulrahman Aloraini | Juntao Yu | Massimo Poesio
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

No neural coreference resolver for Arabic exists, in fact we are not aware of any learning-based coreference resolver for Arabic since (Björkelund and Kuhn, 2014). In this paper, we introduce a coreference resolution system for Arabic based on Lee et al’s end-to-end architecture combined with the Arabic version of bert and an external mention detector. As far as we know, this is the first neural coreference resolution system aimed specifically to Arabic, and it substantially outperforms the existing state-of-the-art on OntoNotes 5.0 with a gain of 15.2 points conll F1. We also discuss the current limitations of the task for Arabic and possible approaches that can tackle these challenges.

Aggregation Driven Progression System for GWAPs
Osman Doruk Kicikoglu | Richard Bartle | Jon Chamberlain | Silviu Paun | Massimo Poesio
Workshop on Games and Natural Language Processing

As the uses of Games-With-A-Purpose (GWAPs) broadens, the systems that incorporate its usages have expanded in complexity. The types of annotations required within the NLP paradigm set such an example, where tasks can involve varying complexity of annotations. Assigning more complex tasks to more skilled players through a progression mechanism can achieve higher accuracy in the collected data while acting as a motivating factor that rewards the more skilled players. In this paper, we present the progression technique implemented in Wormingo , an NLP GWAP that currently includes two layers of task complexity. For the experiment, we have implemented four different progression scenarios on 192 players and compared the accuracy and engagement achieved with each scenario.

Neural Mention Detection
Juntao Yu | Bernd Bohnet | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Mention detection is an important preprocessing step for annotation and interpretation in applications such as NER and coreference resolution, but few stand-alone neural models have been proposed able to handle the full range of mentions. In this work, we propose and compare three neural network-based approaches to mention detection. The first approach is based on the mention detection part of a state of the art coreference resolution system; the second uses ELMO embeddings together with a bidirectional LSTM and a biaffine classifier; the third approach uses the recently introduced BERT model. Our best model (using a biaffine classifier) achieves gains of up to 1.8 percentage points on mention recall when compared with a strong baseline in a HIGH RECALL coreference annotation setting. The same model achieves improvements of up to 5.3 and 6.2 p.p. when compared with the best-reported mention detection F1 on the CONLL and CRAC coreference data sets respectively in a HIGH F1 annotation setting. We then evaluate our models for coreference resolution by using mentions predicted by our best model in start-of-the-art coreference systems. The enhanced model achieved absolute improvements of up to 1.7 and 0.7 p.p. when compared with our strong baseline systems (pipeline system and end-to-end system) respectively. For nested NER, the evaluation of our model on the GENIA corpora shows that our model matches or outperforms state-of-the-art models despite not being specifically designed for this task.

A Cluster Ranking Model for Full Anaphora Resolution
Juntao Yu | Alexandra Uma | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Anaphora resolution (coreference) systems designed for the CONLL 2012 dataset typically cannot handle key aspects of the full anaphora resolution task such as the identification of singletons and of certain types of non-referring expressions (e.g., expletives), as these aspects are not annotated in that corpus. However, the recently released dataset for the CRAC 2018 Shared Task can now be used for that purpose. In this paper, we introduce an architecture to simultaneously identify non-referring expressions (including expletives, predicative s, and other types) and build coreference chains, including singletons. Our cluster-ranking system uses an attention mechanism to determine the relative importance of the mentions in the same cluster. Additional classifiers are used to identify singletons and non-referring markables. Our contributions are as follows. First all, we report the first result on the CRAC data using system mentions; our result is 5.8% better than the shared task baseline system, which used gold mentions. Second, we demonstrate that the availability of singleton clusters and non-referring expressions can lead to substantially improved performance on non-singleton clusters as well. Third, we show that despite our model not being designed specifically for the CONLL data, it achieves a score equivalent to that of the state-of-the-art system by Kantor and Globerson (2019) on that dataset.

Cross-lingual Zero Pronoun Resolution
Abdulrahman Aloraini | Massimo Poesio
Proceedings of the Twelfth Language Resources and Evaluation Conference

In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certain syntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifying and resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavily on semantic coherence and lexical relationships. We propose a BERT-based cross-lingual model for zero pronoun resolution, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolution for Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction and fine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating which layer encodes the most suitable representation for the task.

Polygloss - A conversational agent for language practice
Etiene da Cruz Dalcol | Massimo Poesio
Proceedings of the 9th Workshop on NLP for Computer Assisted Language Learning

Word Sense Distance in Human Similarity Judgements and Contextualised Word Embeddings
Janosch Haber | Massimo Poesio
Proceedings of the Probability and Meaning Conference (PaM 2020)

Homonymy is often used to showcase one of the advantages of context-sensitive word embedding techniques such as ELMo and BERT. In this paper we want to shift the focus to the related but less exhaustively explored phenomenon of polysemy, where a word expresses various distinct but related senses in different contexts. Specifically, we aim to i) investigate a recent model of polyseme sense clustering proposed by Ortega-Andres & Vicente (2019) through analysing empirical evidence of word sense grouping in human similarity judgements, ii) extend the evaluation of context-sensitive word embedding systems by examining whether they encode differences in word sense similarity and iii) compare the word sense similarities of both methods to assess their correlation and gain some intuition as to how well contextualised word embeddings could be used as surrogate word sense similarity judgements in linguistic experiments.

Assessing Polyseme Sense Similarity through Co-predication Acceptability and Contextualised Embedding Distance
Janosch Haber | Massimo Poesio
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Co-predication is one of the most frequently used linguistic tests to tell apart shifts in polysemic sense from changes in homonymic meaning. It is increasingly coming under criticism as evidence is accumulating that it tends to mis-classify specific cases of polysemic sense alteration as homonymy. In this paper, we collect empirical data to investigate these accusations. We asses how co-predication acceptability relates to explicit ratings of polyseme word sense similarity, and how well either measure can be predicted through the distance between target words’ contextualised word embeddings. We find that sense similarity appears to be a major contributor in determining co-predication acceptability, but that co-predication judgements tend to rate especially less similar sense interpretations equally as unacceptable as homonym pairs, effectively mis-classifying these instances. The tested contextualised word embeddings fail to predict word sense similarity consistently, but the similarities between BERT embeddings show a significant correlation with co-predication ratings. We take this finding as evidence that BERT embeddings might be better representations of context than encodings of word meaning.

The QMUL/HRBDT contribution to the NADI Arabic Dialect Identification Shared Task
Abdulrahman Aloraini | Massimo Poesio | Ayman Alhelbawy
Proceedings of the Fifth Arabic Natural Language Processing Workshop

We present the Arabic dialect identification system that we used for the country-level subtask of the NADI challenge. Our model consists of three components: BiLSTM-CNN, character-level TF-IDF, and topic modeling features. We represent each tweet using these features and feed them into a deep neural network. We then add an effective heuristic that improves the overall performance. We achieved an F1-Macro score of 20.77% and an accuracy of 34.32% on the test set. The model was also evaluated on the Arabic Online Commentary dataset, achieving results better than the state-of-the-art.

2019

A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
Massimo Poesio | Jon Chamberlain | Silviu Paun | Juntao Yu | Alexandra Uma | Udo Kruschwitz
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a corpus of anaphoric information (coreference) crowdsourced through a game-with-a-purpose. The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one of the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable: 20 on average, and over 2.2M in total. This characteristic makes the corpus a unique resource for the study of disagreements on anaphoric interpretation. A second distinctive feature is its rich annotation scheme, covering singletons, expletives, and split-antecedent plurals. Finally, the corpus also comes with labels inferred using a recently proposed probabilistic model of annotation for coreference. The labels are of high quality and make it possible to successfully train a state of the art coreference resolver, including training on singletons and non-referring expressions. The annotation model can also result in more than one label, or no label, being proposed for a markable, thus serving as a baseline method for automatically identifying ambiguous markables. A preliminary analysis of the results is presented.

Crowdsourcing and Aggregating Nested Markable Annotations
Chris Madge | Juntao Yu | Jon Chamberlain | Udo Kruschwitz | Silviu Paun | Massimo Poesio
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

One of the key steps in language resource creation is the identification of the text segments to be annotated, or markables, which depending on the task may vary from nominal chunks for named entity resolution to (potentially nested) noun phrases in coreference resolution (or mentions) to larger text segments in text segmentation. Markable identification is typically carried out semi-automatically, by running a markable identifier and correcting its output by hand–which is increasingly done via annotators recruited through crowdsourcing and aggregating their responses. In this paper, we present a method for identifying markables for coreference annotation that combines high-performance automatic markable detectors with checking with a Game-With-A-Purpose (GWAP) and aggregation using a Bayesian annotation model. The method was evaluated both on news data and data from a variety of other genres and results in an improvement on F1 of mention boundaries of over seven percentage points when compared with a state-of-the-art, domain-independent automatic mention detector, and almost three points over an in-domain mention detector. One of the key contributions of our proposal is its applicability to the case in which markables are nested, as is the case with coreference markables; but the GWAP and several of the proposed markable detectors are task and language-independent and are thus applicable to a variety of other annotation scenarios.

Using Automatically Extracted Minimum Spans to Disentangle Coreference Evaluation from Boundary Detection
Nafise Sadat Moosavi | Leo Born | Massimo Poesio | Michael Strube
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The common practice in coreference resolution is to identify and evaluate the maximum span of mentions. The use of maximum spans tangles coreference evaluation with the challenges of mention boundary detection like prepositional phrase attachment. To address this problem, minimum spans are manually annotated in smaller corpora. However, this additional annotation is costly and therefore, this solution does not scale to large corpora. In this paper, we propose the MINA algorithm for automatically extracting minimum spans to benefit from minimum span evaluation in all corpora. We show that the extracted minimum spans by MINA are consistent with those that are manually annotated by experts. Our experiments show that using minimum spans is in particular important in cross-dataset coreference evaluation, in which detected mention boundaries are noisier due to domain shift. We have integrated MINA into https://github.com/ns-moosavi/coval for reporting standard coreference scores based on both maximum and automatically detected minimum spans.

2018

A Probabilistic Annotation Model for Crowdsourcing Coreference
Silviu Paun | Jon Chamberlain | Udo Kruschwitz | Juntao Yu | Massimo Poesio
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The availability of large scale annotated corpora for coreference is essential to the development of the field. However, creating resources at the required scale via expert annotation would be too expensive. Crowdsourcing has been proposed as an alternative; but this approach has not been widely used for coreference. This paper addresses one crucial hurdle on the way to make this possible, by introducing a new model of annotation for aggregating crowdsourced anaphoric annotations. The model is evaluated along three dimensions: the accuracy of the inferred mention pairs, the quality of the post-hoc constructed silver chains, and the viability of using the silver chains as an alternative to the expert-annotated chains in training a state of the art coreference system. The results suggest that our model can extract from crowdsourced annotations coreference chains of comparable quality to those obtained with expert annotation.

Comparing Bayesian Models of Annotation
Silviu Paun | Bob Carpenter | Jon Chamberlain | Dirk Hovy | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 6

The analysis of crowdsourced annotations in natural language processing is concerned with identifying (1) gold standard labels, (2) annotator accuracies and biases, and (3) item difficulties and error patterns. Traditionally, majority voting was used for 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus annotations have proven better at all three tasks. But there has been relatively little work comparing them on the same datasets. This paper aims to fill this gap by analyzing six models of annotation, covering different approaches to annotator ability, item difficulty, and parameter pooling (tying) across annotators and items. We evaluate these models along four aspects: comparison to gold labels, predictive accuracy for new annotations, annotator characterization, and item difficulty, using four datasets with varying degrees of noise in the form of random (spammy) annotators. We conclude with guidelines for model selection, application, and implementation.

Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference
Massimo Poesio | Vincent Ng | Maciej Ogrodniczuk
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

Anaphora Resolution with the ARRAU Corpus
Massimo Poesio | Yulia Grishina | Varada Kolhatkar | Nafise Moosavi | Ina Roesiger | Adam Roussel | Fabian Simonjetz | Alexandra Uma | Olga Uryupina | Juntao Yu | Heike Zinsmeister
Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including non-referring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Task–identity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.

2017

Visually Grounded and Textual Semantic Models Differentially Decode Brain Activity Associated with Concrete and Abstract Nouns
Andrew J. Anderson | Douwe Kiela | Stephen Clark | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 5

Important advances have recently been made using computational semantic models to decode brain activity patterns associated with concepts; however, this work has almost exclusively focused on concrete nouns. How well these models extend to decoding abstract nouns is largely unknown. We address this question by applying state-of-the-art computational models to decode functional Magnetic Resonance Imaging (fMRI) activity patterns, elicited by participants reading and imagining a diverse set of both concrete and abstract nouns. One of the models we use is linguistic, exploiting the recent word2vec skipgram approach trained on Wikipedia. The second is visually grounded, using deep convolutional neural networks trained on Google Images. Dual coding theory considers concrete concepts to be encoded in the brain both linguistically and visually, and abstract concepts only linguistically. Splitting the fMRI data according to human concreteness ratings, we indeed observe that both models significantly decode the most concrete nouns; however, accuracy is significantly greater using the text-based models for the most abstract nouns. More generally this confirms that current computational models are sufficiently advanced to assist in investigating the representational structure of abstract concepts in the brain.

Incongruent Headlines: Yet Another Way to Mislead Your Readers
Sophie Chesney | Maria Liakata | Massimo Poesio | Matthew Purver
Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism

This paper discusses the problem of incongruent headlines: those which do not accurately represent the information contained in the article with which they occur. We emphasise that this phenomenon should be considered separately from recognised problematic headline types such as clickbait and sensationalism, arguing that existing natural language processing (NLP) methods applied to these related concepts are not appropriate for the automatic detection of headline incongruence, as an analysis beyond stylistic traits is necessary. We therefore suggest a number of alternative methodologies that may be appropriate to the task at hand as a foundation for future work in this area. In addition, we provide an analysis of existing data sets which are related to this work, and motivate the need for a novel data set in this domain.

2016

The OnForumS corpus from the Shared Task on Online Forum Summarisation at MultiLing 2015
Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio | Josef Steinberger | Jorge Valderrama | Hugo Zaragoza
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present the OnForumS corpus developed for the shared task of the same name on Online Forum Summarisation (OnForumS at MultiLing’15). The corpus consists of a set of news articles with associated readers’ comments from The Guardian (English) and La Repubblica (Italian). It comes with four levels of annotation: argument structure, comment-article linking, sentiment and coreference. The former three were produced through crowdsourcing, whereas the latter, by an experienced annotator using a mature annotation scheme. Given its annotation breadth, we believe the corpus will prove a useful resource in stimulating and furthering research in the areas of Argumentation Mining, Summarisation, Sentiment, Coreference and the interlinks therein.

Towards a Corpus of Violence Acts in Arabic Social Media
Ayman Alhelbawy | Udo Kruschwitz | Massimo Poesio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper we present a new corpus of Arabic tweets that mention some form of violent event, developed to support the automatic identification of Human Rights Abuse. The dataset was manually labelled for seven classes of violence using crowdsourcing.

Phrase Detectives Corpus 1.0 Crowdsourced Anaphoric Coreference.
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (kappa=.88-.93) but a more variable baseline crowd agreement (kappa=.52-.96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions
Olga Uryupina | Ron Artstein | Antonella Bristot | Federica Cavicchio | Kepa Rodriguez | Massimo Poesio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phenomena to include referentiality and genericity and designed and implemented a methodology for enforcing the consistency of the manual annotation. We believe that the new release of ARRAU provides a valuable material for ongoing research in complex cases of coreference as well as for a variety of related tasks. The corpus is publicly available through LDC.

Coreference Resolution for the Basque Language with BART
Ander Soraluze | Olatz Arregi | Xabier Arregi | Arantza Díaz de Ilarraza | Mijail Kabadjov | Massimo Poesio
Proceedings of the Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2016)

Predicting Brexit: Classifying Agreement is Better than Sentiment and Pollsters
Fabio Celli | Evgeny Stepanov | Massimo Poesio | Giuseppe Riccardi
Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media (PEOPLES)

On June 23rd 2016, UK held the referendum which ratified the exit from the EU. While most of the traditional pollsters failed to forecast the final vote, there were online systems that hit the result with high accuracy using opinion mining techniques and big data. Starting one month before, we collected and monitored millions of posts about the referendum from social media conversations, and exploited Natural Language Processing techniques to predict the referendum outcome. In this paper we discuss the methods used by traditional pollsters and compare it to the predictions based on different opinion mining techniques. We find that opinion mining based on agreement/disagreement classification works better than opinion mining based on polarity classification in the forecast of the referendum outcome.

2015

Combining Minimally-supervised Methods for Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Transactions of the Association for Computational Linguistics, Volume 3

Supervised methods can achieve high performance on NLP tasks, such as Named Entity Recognition (NER), but new annotations are required for every new domain and/or genre change. This has motivated research in minimally supervised methods such as semi-supervised learning and distant learning, but neither technique has yet achieved performance levels comparable to those of supervised methods. Semi-supervised methods tend to have very high precision but comparatively low recall, whereas distant learning tends to achieve higher recall but lower precision. This complementarity suggests that better results may be obtained by combining the two types of minimally supervised methods. In this paper we present a novel approach to Arabic NER using a combination of semi-supervised and distant learning techniques. We trained a semi-supervised NER classifier and another one using distant learning techniques, and then combined them using a variety of classifier combination schemes, including the Bayesian Classifier Combination (BCC) procedure recently proposed for sentiment analysis. According to our results, the BCC model leads to an increase in performance of 8 percentage points over the best base classifiers.

MultiLing 2015: Multilingual Summarization of Single and Multi-Documents, On-line Fora, and Call-center Conversations
George Giannakopoulos | Jeff Kubina | John Conroy | Josef Steinberger | Benoit Favre | Mijail Kabadjov | Udo Kruschwitz | Massimo Poesio
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2014

Identifying fake Amazon reviews as learning from crowds
Tommaso Fornaciari | Massimo Poesio
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

Automatic Creation of Arabic Named Entity Annotated Corpus Using Wikipedia
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics

AraNLP: a Java-based Library for the Processing of Arabic Text.
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a free, Java-based library named “AraNLP” that covers various Arabic text preprocessing tools. Although a good number of tools for processing Arabic text already exist, integration and compatibility problems continually occur. AraNLP is an attempt to gather most of the vital Arabic text preprocessing tools into one library that can be accessed easily by integrating or accurately adapting existing tools and by developing new ones when required. The library includes a sentence detector, tokenizer, light stemmer, root stemmer, part-of speech tagger (POS-tagger), word segmenter, normalizer, and a punctuation and diacritic remover.

2013

Of Words, Eyes and Brains: Correlating Image-Based Distributional Semantic Models with Neural Representations of Concepts
Andrew J. Anderson | Elia Bruni | Ulisse Bordignon | Massimo Poesio | Marco Baroni
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Adapting a State-of-the-art Anaphora Resolution System for Resource-poor Language
Utpal Sikdar | Asif Ekbal | Sriparna Saha | Olga Uryupina | Massimo Poesio
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A Semi-supervised Learning Approach to Arabic Named Entity Recognition
Maha Althobaiti | Udo Kruschwitz | Massimo Poesio
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

Relational Structures and Models for Coreference Resolution
Truc-Vien T. Nguyen | Massimo Poesio
Proceedings of COLING 2012: Posters

DeCour: a corpus of DEceptive statements in Italian COURts
Tommaso Fornaciari | Massimo Poesio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In criminal proceedings, sometimes it is not easy to evaluate the sincerity of oral testimonies. DECOUR - DEception in COURt corpus - has been built with the aim of training models suitable to discriminate, from a stylometric point of view, between sincere and deceptive statements. DECOUR is a collection of hearings held in four Italian Courts, in which the speakers lie in front of the judge. These hearings become the object of a specific criminal proceeding for calumny or false testimony, in which the deceptiveness of the statements of the defendant is ascertained. Thanks to the final Court judgment, that points out which lies are told, each utterance of the corpus has been annotated as true, uncertain or false, according to its degree of truthfulness. Since the judgment of deceptiveness follows a judicial inquiry, the annotation has been realized with a greater degree of confidence than ever before. Moreover, in Italy this is the first corpus of deceptive texts not relying on mock' lies created in laboratory conditions, but which has been collected in a natural environment.

Domain-specific vs. Uniform Modeling for Coreference Resolution
Olga Uryupina | Massimo Poesio
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Several corpora annotated for coreference have been made available in the past decade. These resources differ with respect to their size and the underlying structure: the number of domains and their similarity. Our study compares domain-specific models, learned from small heterogeneous subsets of the investigated corpora, against uniform models, that utilize all the available data. We show that for knowledge-poor baseline systems, domain-specific and uniform modeling yield same results. Systems, relying on large amounts of linguistic knowledge, however, exhibit differences in their performance: with all the designed features in use, domain-specific models suffer from over-fitting, whereas with pre-selected feature sets they tend to outperform union models.

On the Use of Homogenous Sets of Subjects in Deceptive Language Analysis
Tommaso Fornaciari | Massimo Poesio
Proceedings of the Workshop on Computational Approaches to Deception Detection

Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities
Francesca Bonin | Fabio Cavulli | Aronne Noriller | Massimo Poesio | Egon W. Stemle
Proceedings of the Sixth Linguistic Annotation Workshop

BART goes multilingual: The UniTN / Essex submission to the CoNLL-2012 Shared Task
Olga Uryupina | Alessandro Moschitti | Massimo Poesio
Joint Conference on EMNLP and CoNLL - Shared Task

On discriminating fMRI representations of abstract WordNet taxonomic categories
Andrew Anderson | Tao Yuan | Brian Murphy | Massimo Poesio
Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon

2011

Single and multi-objective optimization for feature selection in anaphora resolution
Sriparna Saha | Asif Ekbal | Olga Uryupina | Massimo Poesio
Proceedings of 5th International Joint Conference on Natural Language Processing

A Cross-Lingual ILP Solution to Zero Anaphora Resolution
Ryu Iida | Massimo Poesio
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Structure-Preserving Pipelines for Digital Libraries
Massimo Poesio | Eduard Barbu | Egon Stemle | Christian Girardi
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Multi-metric optimization for coreference: The UniTN / IITP / Essex submission to the 2011 CONLL Shared Task
Olga Uryupina | Sriparna Saha | Asif Ekbal | Massimo Poesio
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

2010

Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
Kepa Joseba Rodríguez | Francesca Delogu | Yannick Versley | Egon W. Stemle | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.

BabyExp: Constructing a Huge Multimodal Resource to Acquire Commonsense Knowledge Like Children Do
Massimo Poesio | Marco Baroni | Oswald Lanz | Alessandro Lenci | Alexandros Potamianos | Hinrich Schütze | Sabine Schulte im Walde | Luca Surian
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

There is by now widespread agreement that the most realistic way to construct the large-scale commonsense knowledge repositories required by natural language and artificial intelligence applications is by letting machines learn such knowledge from large quantities of data, like humans do. A lot of attention has consequently been paid to the development of increasingly sophisticated machine learning algorithms for knowledge extraction. However, the nature of the input that humans are exposed to while learning commonsense knowledge has received much less attention. The BabyExp project is collecting very dense audio and video recordings of the first 3 years of life of a baby. The corpus constructed in this way will be transcribed with automated techniques and made available to the research community. Moreover, techniques to extract commonsense conceptual knowledge incrementally from these multimodal data are also being explored within the project. The current paper describes BabyExp in general, and presents pilot studies on the feasibility of the automated audio and video transcriptions.

Extending BART to Provide a Coreference Resolution System for German
Samuel Broscheit | Simone Paolo Ponzetto | Yannick Versley | Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a flexible toolkit-based approach to automatic coreference resolution on German text. We start with our previous work aimed at reimplementing the system from Soon et al. (2001) for English, and extend it to duplicate a version of the state-of-the-art proposal from Klenner and Ailloud (2009). Evaluation performed on a benchmarking dataset, namely the TueBa-D/Z corpus (Hinrichs et al., 2005b), shows that machine learning based coreference resolution can be robustly performed in a language other than English.

Creating a Coreference Resolution System for Italian
Massimo Poesio | Olga Uryupina | Yannick Versley
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper summarizes our work on creating a full-scale coreference resolution (CR) system for Italian, using BART ― an open-source modular CR toolkit initially developed for English corpora. We discuss our experiments on language-specific issues of the task. As our evaluation experiments show, a language-agnostic system (designed primarily for English) can achieve a performance level in high forties (MUC F-score) when re-trained and tested on a new language, at least on gold mention boundaries. Compared to this level, we can improve our F-score by around 10% introducing a small number of language-specific changes. This shows that, with a modular coreference resolution platform, such as BART, one can straightforwardly develop a family of robust and reliable systems for various languages. We hope that our experiments will encourage researchers working on coreference in other languages to create their own full-scale coreference resolution systems ― as we have mentioned above, at the moment such modules exist only for very few languages other than English.

SemEval-2010 Task 1: Coreference Resolution in Multiple Languages
Marta Recasens | Lluís Màrquez | Emili Sapena | M. Antònia Martí | Mariona Taulé | Véronique Hoste | Massimo Poesio | Yannick Versley
Proceedings of the 5th International Workshop on Semantic Evaluation

BART: A Multilingual Anaphora Resolution System
Samuel Broscheit | Massimo Poesio | Simone Paolo Ponzetto | Kepa Joseba Rodriguez | Lorenza Romano | Olga Uryupina | Yannick Versley | Roberto Zanoli
Proceedings of the 5th International Workshop on Semantic Evaluation

Detecting Semantic Category in Simultaneous EEG/MEG Recordings
Brian Murphy | Massimo Poesio
Proceedings of the NAACL HLT 2010 First Workshop on Computational Neurolinguistics

Proceedings of the Fourth Linguistic Annotation Workshop
Nianwen Xue | Massimo Poesio
Proceedings of the Fourth Linguistic Annotation Workshop

2009

EEG responds to conceptual stimuli and corpus semantics
Brian Murphy | Marco Baroni | Massimo Poesio
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Evaluating Centering for Information Ordering Using Corpora
Nikiforos Karamanis | Chris Mellish | Massimo Poesio | Jon Oberlander
Computational Linguistics, Volume 35, Number 1, March 2009

Obituaries: Janet Hitzeman
Massimo Poesio | David Day | Inderjeet Mani
Computational Linguistics, Volume 35, Number 4, December 2009

State-of-the-art NLP Approaches to Coreference Resolution: Theory and Practical Recipes
Simone Paolo Ponzetto | Massimo Poesio
Tutorial Abstracts of ACL-IJCNLP 2009

Unsupervised Knowledge Extraction for Taxonomies of Concepts from Wikipedia
Eduard Barbu | Massimo Poesio
Proceedings of the International Conference RANLP-2009

Constructing an Anaphorically Annotated Corpus with Non-Experts: Assessing the Quality of Collaborative Annotations
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)

Play your way to an annotated corpus: Games with a purpose and anaphoric annotation
Massimo Poesio
Proceedings of the Eight International Conference on Computational Semantics

Interactive Gesture in Dialogue: a PTT Model
Hannes Rieser | Massimo Poesio
Proceedings of the SIGDIAL 2009 Conference

2008

Coreference Systems Based on Kernels Methods
Yannick Versley | Alessandro Moschitti | Massimo Poesio | Xiaofeng Yang
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

Survey Article: Inter-Coder Agreement for Computational Linguistics
Ron Artstein | Massimo Poesio
Computational Linguistics, Volume 34, Number 4, December 2008

A Corpus for Cross-Document Co-reference
David Day | Janet Hitzeman | Michael Wick | Keith Crouch | Massimo Poesio
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes a newly created text corpus of news articles that has been annotated for cross-document co-reference. Being able to robustly resolve references to entities across document boundaries will provide a useful capability for a variety of tasks, ranging from practical information retrieval applications to challenging research in information extraction and natural language understanding. This annotated corpus is intended to encourage the development of systems that can more accurately address this problem. A manual annotation tool was developed that allowed the complete corpus to be searched for likely co-referring entity mentions. This corpus of 257K words links mentions of co-referent people, locations and organizations (subject to some additional constraints). Each of the documents had already been annotated for within-document co-reference by the LDC as part of the ACE series of evaluations. The annotation process was bootstrapped with a string-matching-based linking procedure, and we report on some of initial experimentation with the data. The cross-document linking information will be made publicly available.

Anaphoric Annotation in the ARRAU Corpus
Massimo Poesio | Ron Artstein
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Arrau is a new corpus annotated for anaphoric relations, with information about agreement and explicit representation of multiple antecedents for ambiguous anaphoric expressions and discourse antecedents for expressions which refer to abstract entities such as events, actions and plans. The corpus contains texts from different genres: task-oriented dialogues from the Trains-91 and Trains-93 corpus, narratives from the English Pear Stories corpus, newspaper articles from the Wall Street Journal portion of the Penn Treebank, and mixed text from the Gnome corpus.

BART: A modular toolkit for coreference resolution
Yannick Versley | Simone Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Developing a full coreference system able to run all the way from raw text to semantic interpretation is a considerable engineering effort. Accordingly, there is very limited availability of off-the shelf tools for researchers whose interests are not primarily in coreference or others who want to concentrate on a specific aspect of the problem. We present BART, a highly modular toolkit for developing coreference applications. In the Johns Hopkins workshop on using lexical and encyclopedic knowledge for entity disambiguation, the toolkit was used to extend a reimplementation of Soon et al.s proposal with a variety of additional syntactic and knowledge-based features, and experiment with alternative resolution processes, preprocessing tools, and classifiers. BART has been released as open source software and is available from http://www.sfs.uni-tuebingen.de/~versley/BART

ANAWIKI: Creating Anaphorically Annotated Resources through Web Cooperation
Massimo Poesio | Udo Kruschwitz | Jon Chamberlain
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ability to make progress in Computational Linguistics depends on the availability of large annotated corpora, but creating such corpora by hand annotation is very expensive and time consuming; in practice, it is unfeasible to think of annotating more than one million words. However, the success of Wikipedia and other projects shows that another approach might be possible: take advantage of the willingness of Web users to contribute to collaborative resource creation. AnaWiki is a recently started project that will develop tools to allow and encourage large numbers of volunteers over the Web to collaborate in the creation of semantically annotated corpora (in the first instance, of a corpus annotated with information about anaphora).

BART: A Modular Toolkit for Coreference Resolution
Yannick Versley | Simone Paolo Ponzetto | Massimo Poesio | Vladimir Eidelman | Alan Jern | Jason Smith | Xiaofeng Yang | Alessandro Moschitti
Proceedings of the ACL-08: HLT Demo Session

Addressing the Resource Bottleneck to Create Large-Scale Annotated Texts
Jon Chamberlain | Massimo Poesio | Udo Kruschwitz
Semantics in Text Processing. STEP 2008 Conference Proceedings

2007

Discovering contradicting protein-protein interactions in text
Olivia Sanchez | Massimo Poesio
Biological, translational, and clinical language processing

Standoff Coordination for Multi-Tool Annotation in a Dialogue Corpus
Kepa Joseba Rodríguez | Stefanie Dipper | Michael Götze | Massimo Poesio | Giuseppe Riccardi | Christian Raymond | Joanna Rabiega-Wiśniewska
Proceedings of the Linguistic Annotation Workshop

2006

An Anaphora Resolution-Based Anonymization Module
M. Poesio | M. A. Kabadjov | P. Goux | U. Kruschwitz | E. Bishop | L. Corti
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Growing privacy and security concerns mean there is an increasing need for data to be anonymized before being publically released. We present a module for anonymizing references implemented as part of the SQUAD tools for specifying and testing non-proprietary means of storing and marking-up data using universal (XML) standards and technologies. The tool is implemented on top of the GUITAR anaphoric resolver.

2005

Improving LSA-based Summarization with Anaphora Resolution
Josef Steinberger | Mijail Kabadjov | Massimo Poesio | Olivia Sanchez-Graillet
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Merging PropBank, NomBank, TimeBank, Penn Discourse Treebank and Coreference
James Pustejovsky | Adam Meyers | Martha Palmer | Massimo Poesio
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

The Reliability of Anaphoric Annotation, Reconsidered: Taking Ambiguity into Account
Massimo Poesio | Ron Artstein
Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky

Identifying Concept Attributes Using a Classifier
Massimo Poesio | Abdulrahman Almuhareb
Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition

2004

Centering: A Parametric Theory and Its Instantiations
Massimo Poesio | Rosemary Stevenson | Barbara Di Eugenio | Janet Hitzeman
Computational Linguistics, Volume 30, Number 3, September 2004

Acquiring Bayesian Networks from Text
Olivia Sanchez-Graillet | Massimo Poesio
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

A General-Purpose, Off-the-shelf Anaphora Resolution Module: Implementation and Preliminary Evaluation
Massimo Poesio | Mijail A. Kabadjov
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Learning to Resolve Bridging References
Massimo Poesio | Rahul Mehta | Axel Maroudas | Janet Hitzeman
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

Evaluating Centering-Based Metrics of Coherence
Nikiforos Karamanis | Massimo Poesio | Chris Mellish | Jon Oberlander
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

Discourse Annotation and Semantic Annotation in the GNOME corpus
Massimo Poesio
Proceedings of the Workshop on Discourse Annotation

Discourse-New Detectors for Definite Description Resolution: A Survey and a Preliminary Proposal
Massimo Poesio | Olga Uryupina | Renata Vieira | Mijail Alexandrov-Kabadjov | Rodrigo Goulart
Proceedings of the Conference on Reference Resolution and Its Applications

The MATE/GNOME Proposals for Anaphoric Annotation, Revisited
Massimo Poesio
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004

Attribute-Based and Value-Based Clustering: An Evaluation
Abdulrahman Almuhareb | Massimo Poesio
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

Identifying Broken Plurals in Unvowelised Arabic Tex
Abduelbaset Goweder | Massimo Poesio | Anne De Roeck | Jeff Reynolds
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

Associative Descriptions and Salience: A Preliminary Investigation
Massimo Poesio
Proceedings of the 2003 EACL Workshop on The Computational Treatment of Anaphora

2002

Acquiring Lexical Knowledge for Anaphora Resolution
Massimo Poesio | Tomonori Ishikawa | Sabine Schulte im Walde | Renata Vieira
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2001

Corpus-based NP Modifier Generation
Hua Cheng | Massimo Poesio | Renate Henschel | Chris Mellish
Second Meeting of the North American Chapter of the Association for Computational Linguistics

2000

Modelling Grounding and Discourse Obligations Using Update Rules
Colin Matheson | Massimo Poesio | David Traum
1st Meeting of the North American Chapter of the Association for Computational Linguistics

Pronominalization revisited
Renate Henschel | Hua Cheng | Massimo Poesio
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

Corpus-based Development and Evaluation of a System for Processing Definite Descriptions
Renata Vieira | Massimo Poesio
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

An Empirically-based System for Processing Definite Descriptions
Renata Vieira | Massimo Poesio
Computational Linguistics, Volume 26, Number 4, December 2000

Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms: Issues and Preliminary Results
Massimo Poesio
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Specifying the Parameters of Centering Theory: a Corpus-Based Evaluation using Text from Application-Oriented Domains
M. Poesio | H. Cheng | R. Henschel | J. Hitzeman | R. Kibble | R. Stevenson
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

Semantic Annotation for Generation: Issues in Annotating a Corpus to Develop and Evaluate Discourse Entity Realization Algorithms
Massimo Poesio
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

1999

The MATE meta-scheme for coreference in dialogues in multiple languages
M. Poesio | F. Bruneseaux | L. Romary
Towards Standards and Tools for Discourse Tagging

1998

Long Distance Pronominalisation and Global Focus
Janet Hitzeman | Massimo Poesio
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

A Corpus-based Investigation of Definite Description Use
Massimo Poesio | Renata Vieira
Computational Linguistics, Volume 24, Number 2, June 1998

Long Distance Pronominalisation and Global Focus
Janet Hitzeman | Massimo Poesio
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

1997

Resolving bridging references in unrestricted text
Massimo Poesio | Renata Vieira | Simone Teufel
Operational Factors in Practical, Robust Anaphora Resolution for Unrestricted Texts

1996

Book Reviews: Logic and Lexicon
Massimo Poesio
Computational Linguistics, Volume 22, Number 1, March 1996

1993

Temporal Centering
Megumi Kameyama | Rebecca Passonneau | Massimo Poesio
31st Annual Meeting of the Association for Computational Linguistics

Assigning a Semantic Scope to Operators
Massimo Poesio
31st Annual Meeting of the Association for Computational Linguistics

Co-authors

Olga Uryupina 10

Tommaso Fornaciari 9

Sameer Pradhan 9

Alexandra Uma 9

Abdulrahman Aloraini 8

Yannick Versley 8

Mijail Kabadjov 7

Nafise Sadat Moosavi 7

Maciej Ogrodniczuk 7

Janet Hitzeman 6

Renata Vieira 6

Barbara Plank 5

Simone Paolo Ponzetto 5

Michael Strube 5

Maha Althobaiti 4

Janosch Haber 4

Ramesh Manuvinakurike 4

Alessandro Moschitti 4

Andrew J. Anderson 3

Valerio Basile 3

Renate Henschel 3

Elisa Leonardelli 3

Chris Mellish 3

Anna Nedoluzhko 3

Michal Novák 3

Maja Pavlovic 3

Matthew Purver 3

Kepa Joseba Rodriguez 3

Sriparna Saha 3

Olivia Sanchez-Graillet 3

Hinrich Schütze 3

Josef Steinberger 3

Xiaofeng Yang 3

Ayman Alhelbawy 2

Wateen Abdullah Aliady 2

Abdulrahman Almuhareb 2

Fatima Althani 2

Richard Bartle 2

Samuel Broscheit 2

Maris Camilleri 2

Vladimir Eidelman 2

Elisabetta Fersini 2

Yulia Grishina 2

Daniil Ignatev 2

Nikiforos Karamanis 2

Jon Oberlander 2

Denis Paperno 2

Roonak Rezvani 2

Giuseppe Riccardi 2

Sabine Schulte im Walde 2

Rosemary Stevenson 2

Sina Zarrieß 2

Gavin Abercrombie 1

Dang Thi Thao Anh 1

Xabier Arregi 1

Ayoub Bagheri 1

Raffaella Bernardi 1

Federico Bianchi 1

Elizabeth Bishop 1

Francesca Bonin 1

Ulisse Bordignon 1

Antonella Bristot 1

Andrea Bruera 1

F. Bruneseaux 1

Bob Carpenter 1

Paloma Carretero Garcia 1

Silvia Casola 1

Federica Cavicchio 1

Fabio Cavulli 1

Sophie Chesney 1

Stephen Clark 1

Anne De Roeck 1

Francesca Delogu 1

Barbara Di Eugenio 1

Stefanie Dipper 1

Giovanni Duca 1

Anca Dumitrache 1

Arantza Díaz de Ilarraza 1

Diego Frassinelli 1

Paloma Garcia 1

Anastasia Giachanou 1

George Giannakopoulos 1

Christian Girardi 1

Rodrigo Goulart 1

Abduelbaset Goweder 1

Ian Paul Grant 1

Michael Götze 1

Veronique Hoste 1

Tomonori Ishikawa 1

Megumi Kameyama 1

Casey Kennington 1

Rodger Kibble 1

Osman Doruk Kicikoglu 1

Varada Kolhatkar 1

Margareta A. Kulcsar 1

Alessandro Lenci 1

Maria Liakata 1

Christopher Madge 1

Inderjeet Mani 1

Axel Maroudas 1

M. Antònia Martí 1

Colin Matheson 1

Tristan Miller 1

Hadi Mohammadi 1

Pablo Mosteiro 1

Maria Mouratidi 1

Lluís Màrquez 1

Mark-Christoph Müller 1

Truc-Vien T. Nguyen 1

Aronne Noriller 1

Martha Palmer 1

Rebecca J. Passonneau 1

Alexandros Potamianos 1

James Pustejovsky 1

Joanna Rabiega-Wiśniewska 1

Christian Raymond 1

Marta Recasens 1

Jeff Reynolds 1

Verena Rieser 1

Hannes Rieser 1

Kepa Rodriguez 1

Lorenza Romano 1

Laurent Romary 1

Marika Sarzotti 1

David Schlangen 1

Utpal Kumar Sikdar 1

Fabian Simonjetz 1

Edwin Simpson 1

Ander Soraluze 1

Evgeny Stepanov 1

Mariona Taulé 1

Simone Teufel 1

Jorge Valderrama 1

Zhenrong Yang 1

Farbod Zamani 1

Roberto Zanoli 1

Hugo Zaragoza 1

Heike Zinsmeister 1

Etiene da Cruz Dalcol 1

Zdeněk Žabokrtský 1

Venues

NLPerspectives4