Raquel Fernández

Also published as: Raquel Fernandez


2024

pdf bib
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation
Jirui Qi | Gabriele Sarti | Raquel Fernández | Arianna Bisazza
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Ensuring the verifiability of model answers is a fundamental challenge for retrieval-augmented generation (RAG) in the question answering (QA) domain. Recently, self-citation prompting was proposed to make large language models (LLMs) generate citations to supporting documents along with their answers. However, self-citing LLMs often struggle to match the required format, refer to non-existent sources, and fail to faithfully reflect LLMs’ context usage throughout the generation. In this work, we present MIRAGE – Model Internals-based RAG Explanations – a plug-and-play approach using model internals for faithful answer attribution in RAG applications. MIRAGE detects context-sensitive answer tokens and pairs them with retrieved documents contributing to their prediction via saliency methods. We evaluate our proposed approach on a multilingual extractive QA dataset, finding high agreement with human answer attribution. On open-ended QA, MIRAGE achieves citation quality and efficiency comparable to self-citation while also allowing for a finer-grained control of attribution parameters. Our qualitative evaluation highlights the faithfulness of MIRAGE’s attributions and underscores the promising application of model internals for RAG answer attribution. Code and data released at https://github.com/Betswish/MIRAGE.

pdf bib
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
Aditya Kaushik Surikuchi | Raquel Fernández | Sandro Pezzelle
Findings of the Association for Computational Linguistics: EMNLP 2024

Visual storytelling consists in generating a natural language story given a temporally ordered sequence of images. This task is not only challenging for models, but also very difficult to evaluate with automatic metrics since there is no consensus about what makes a story ‘good’. In this paper, we introduce a novel method that measures story quality in terms of human likeness regarding three key aspects highlighted in previous work: visual grounding, coherence, and repetitiveness. We then use this method to evaluate the stories generated by several models, showing that the foundation model LLaVA obtains the best result, but only slightly so compared to TAPM, a 50-times smaller visual storytelling model. Upgrading the visual and language components of TAPM results in a model that yields competitive performance with a relatively low number of parameters. Finally, we carry out a human evaluation study, whose results suggest that a ‘good’ story may require more than a human-like level of visual grounding, coherence, and repetition.

pdf bib
Asking the Right Question at the Right Time: Human and Model Uncertainty Guidance to Ask Clarification Questions
Alberto Testoni | Raquel Fernández
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Clarification questions are an essential dialogue tool to signal misunderstanding, ambiguities, and under-specification in language use. While humans are able to resolve uncertainty by asking questions since childhood, modern dialogue systems struggle to generate effective questions. To make progress in this direction, in this work we take a collaborative dialogue task as a testbed and study how model uncertainty relates to human uncertainty—an as yet under-explored problem. We show that model uncertainty does not mirror human clarification-seeking behavior, which suggests that using human clarification questions as supervision for deciding when to ask may not be the most effective way to resolve model uncertainty. To address this issue, we propose an approach to generating clarification questions based on model uncertainty estimation, compare it to several alternatives, and show that it leads to significant improvements in terms of task success. Our findings highlight the importance of equipping dialogue systems with the ability to assess their own uncertainty and exploit in interaction.

pdf bib
Describing Images Fast and Slow: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
Ece Takmaz | Sandro Pezzelle | Raquel Fernández
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

There is an intricate relation between the properties of an image and how humans behave while describing the image. This behavior shows ample variation, as manifested in human signals such as eye movements and when humans start to describe the image. Despite the value of such signals of visuo-linguistic variation, they are virtually disregarded in the training of current pretrained models, which motivates further investigation. Using a corpus of Dutch image descriptions with concurrently collected eye-tracking data, we explore the nature of the variation in visuo-linguistic signals, and find that they correlate with each other. Given this result, we hypothesize that variation stems partly from the properties of the images, and explore whether image representations encoded by pretrained vision encoders can capture such variation. Our results indicate that pretrained models do so to a weak-to-moderate degree, suggesting that the models lack biases about what makes a stimulus complex for humans and what leads to variations in human outputs.

pdf bib
Interpreting Predictive Probabilities: Model Confidence or Human Label Variation?
Joris Baan | Raquel Fernández | Barbara Plank | Wilker Aziz
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good _representation of uncertainty_ by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.

pdf bib
Don’t Buy it! Reassessing the Ad Understanding Abilities of Contrastive Multimodal Models
Anna Bavaresco | Alberto Testoni | Raquel Fernández
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Image-based advertisements are complex multimodal stimuli that often contain unusual visual elements and figurative language. Previous research on automatic ad understanding has reported impressive zero-shot accuracy of contrastive vision-and-language models (VLMs) on an ad-explanation retrieval task. Here, we examine the original task setup and show that contrastive VLMs can solve it by exploiting grounding heuristics. To control for this confound, we introduce TRADE, a new evaluation test set with adversarial grounded explanations. While these explanations look implausible to humans, we show that they “fool” four different contrastive VLMs. Our findings highlight the need for an improved operationalisation of automatic ad understanding that truly evaluates VLMs’ multimodal reasoning abilities. We make our code and TRADE available at https://github.com/dmg-illc/trade.

2023

pdf bib
Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis
Mario Giulianelli | Iris Luden | Raquel Fernandez | Andrey Kutuzov
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition in a usage cluster is chosen as the sense label. We demonstrate how the resulting sense labels can make existing approaches to semantic change analysis more interpretable, and how they can allow users — historical linguists, lexicographers, or social scientists — to explore and intuitively explain diachronic trajectories of word meaning. Semantic change analysis is only one of many possible applications of the ‘definitions as representations’ paradigm. Beyond being human-readable, contextualised definitions also outperform token or usage sentence embeddings in word-in-context semantic similarity judgements, making them a new promising type of lexical representation for NLP.

pdf bib
Speaking the Language of Your Listener: Audience-Aware Adaptation via Plug-and-Play Theory of Mind
Ece Takmaz | Nicolo’ Brandizzi | Mario Giulianelli | Sandro Pezzelle | Raquel Fernandez
Findings of the Association for Computational Linguistics: ACL 2023

Dialogue participants may have varying levels of knowledge about the topic under discussion. In such cases, it is essential for speakers to adapt their utterances by taking their audience into account. Yet, it is an open question how such adaptation can be modelled in computational agents. In this paper, we model a visually grounded referential game between a knowledgeable speaker and a listener with more limited visual and linguistic experience. Inspired by psycholinguistic theories, we endow our speaker with the ability to adapt its referring expressions via a simulation module that monitors the effectiveness of planned utterances from the listener’s perspective. We propose an adaptation mechanism building on plug-and-play approaches to controlled language generation, where utterance generation is steered on the fly by the simulator without finetuning the speaker’s underlying language model. Our results and analyses show that our approach is effective: the speaker’s utterances become closer to the listener’s domain of expertise, which leads to higher communicative success.

pdf bib
GROOViST: A Metric for Grounding Objects in Visual Storytelling
Aditya Surikuchi | Sandro Pezzelle | Raquel Fernández
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A proper evaluation of stories generated for a sequence of images—the task commonly referred to as visual storytelling—must consider multiple aspects, such as coherence, grammatical correctness, and visual grounding. In this work, we focus on evaluating the degree of grounding, that is, the extent to which a story is about the entities shown in the images. We analyze current metrics, both designed for this purpose and for general vision-text alignment. Given their observed shortcomings, we propose a novel evaluation tool, GROOViST, that accounts for cross-modal dependencies, temporal misalignments (the fact that the order in which entities appear in the story and the image sequence may not match), and human intuitions on visual grounding. An additional advantage of GROOViST is its modular design, where the contribution of each component can be assessed and interpreted individually.

pdf bib
Information Value: Measuring Utterance Predictability as Distance from Plausible Alternatives
Mario Giulianelli | Sarenne Wallbridge | Raquel Fernández
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We present information value, a measure which quantifies the predictability of an utterance relative to a set of plausible alternatives. We introduce a method to obtain interpretable estimates of information value using neural text generators, and exploit their psychometric predictive power to investigate the dimensions of predictability that drive human comprehension behaviour. Information value is a stronger predictor of utterance acceptability in written and spoken dialogue than aggregates of token-level surprisal and it is complementary to surprisal for predicting eye-tracked reading times.

pdf bib
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
Xinyi Chen | Raquel Fernández | Sandro Pezzelle
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Despite the impressive performance achieved by pre-trained language-and-vision models in downstream tasks, it remains an open question whether this reflects a proper understanding of image-text interaction. In this work, we explore to what extent they handle basic linguistic constructions—active-passive voice, coordination, and relative clauses—that even preschool children can typically master. We present BLA, a novel, automatically constructed benchmark to evaluate multimodal models on these Basic Language Abilities. We show that different types of Transformer-based systems, such as CLIP, ViLBERT, and BLIP2, generally struggle with BLA in a zero-shot setting, in line with previous findings. Our experiments, in particular, show that most of the tested models only marginally benefit when fine-tuned or prompted with construction-specific samples. Yet, the generative BLIP2 shows promising trends, especially in an in-context learning setting. This opens the door to using BLA not only as an evaluation benchmark but also to improve models’ basic language abilities.

pdf bib
Cross-Lingual Consistency of Factual Knowledge in Multilingual Language Models
Jirui Qi | Raquel Fernández | Arianna Bisazza
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Multilingual large-scale Pretrained Language Models (PLMs) have been shown to store considerable amounts of factual knowledge, but large variations are observed across languages. With the ultimate goal of ensuring that users with different language backgrounds obtain consistent feedback from the same model, we study the cross-lingual consistency (CLC) of factual knowledge in various multilingual PLMs. To this end, we propose a Ranking-based Consistency (RankC) metric to evaluate knowledge consistency across languages independently from accuracy. Using this metric, we conduct an in-depth analysis of the determining factors for CLC, both at model level and at language-pair level. Among other results, we find that increasing model size leads to higher factual probing accuracy in most languages, but does not improve cross-lingual consistency. Finally, we conduct a case study on CLC when new factual associations are inserted in the PLMs via model editing. Results on a small sample of facts inserted in English reveal a clear pattern whereby the new piece of knowledge transfers only to languages with which English has a high RankC score. All code and data are released at https://github.com/Betswish/Cross-Lingual-Consistency.

pdf bib
What Comes Next? Evaluating Uncertainty in Neural Text Generators Against Human Production Variability
Mario Giulianelli | Joris Baan | Wilker Aziz | Raquel Fernández | Barbara Plank
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In Natural Language Generation (NLG) tasks, for any input, multiple communicative goals are plausible, and any goal can be put into words, or produced, in multiple ways. We characterise the extent to which human production varies lexically, syntactically, and semantically across four NLG tasks, connecting human production variability to aleatoric or data uncertainty. We then inspect the space of output strings shaped by a generation system’s predicted probability distribution and decoding algorithm to probe its uncertainty. For each test input, we measure the generator’s calibration to human production variability. Following this instance-level approach, we analyse NLG models and decoding strategies, demonstrating that probing a generator with multiple samples and, when possible, multiple references, provides the level of detail necessary to gain understanding of a model’s representation of uncertainty.

2022

pdf bib
Less Descriptive yet Discriminative: Quantifying the Properties of Multimodal Referring Utterances via CLIP
Ece Takmaz | Sandro Pezzelle | Raquel Fernández
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

In this work, we use a transformer-based pre-trained multimodal model, CLIP, to shed light on the mechanisms employed by human speakers when referring to visual entities. In particular, we use CLIP to quantify the degree of descriptiveness (how well an utterance describes an image in isolation) and discriminativeness (to what extent an utterance is effective in picking out a single image among similar images) of human referring utterances within multimodal dialogues. Overall, our results show that utterances become less descriptive over time while their discriminativeness remains unchanged. Through analysis, we propose that this trend could be due to participants relying on the previous mentions in the dialogue history, as well as being able to distill the most discriminative information from the visual context. In general, our study opens up the possibility of using this and similar models to quantify patterns in human data and shed light on the underlying cognitive mechanisms.

pdf bib
Controllable Text Generation for All Ages: Evaluating a Plug-and-Play Approach to Age-Adapted Dialogue
Lennert Jansen | Štěpán Lars Laichter | Arabella Sinclair | Margot van der Goot | Raquel Fernandez | Sandro Pezzelle
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

To be trusted and perceived as natural and coherent, conversational systems must adapt to the language of their users. While personalized dialogue is a promising direction, controlling generation for fine-grained language features remains a challenge in this approach. A recent line of research showed the effectiveness of leveraging pre-trained language models toward adapting to a text’s topic or sentiment. In this study, we build on these approaches and focus on a higher-level dimension of language variation: speakers’ age. We frame the task as a dialogue response generation, and test methods based on bag-of-words (BoW) and neural discriminators (Disc) to condition the output of GPT-2 and DialoGPT without altering the parameters of the language models. We show that Disc models achieve a higher degree of detectable control than BoW models based on automatic evaluation. In contrast, humans can partially detect age differences in BoW but not Disc responses. Since BoW responses are deemed better than Disc ones by humans, simple controllable methods thus appear to be a better tradeoff between adaptation and language quality. Our work confirms the challenges of adapting to higher-level dimensions of language variation. Moreover, it highlights the need to evaluate natural language generation thoroughly.

pdf bib
Stop Measuring Calibration When Humans Disagree
Joris Baan | Wilker Aziz | Barbara Plank | Raquel Fernandez
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - including class frequency, ranking and entropy.

pdf bib
AnaLog: Testing Analytical and Deductive Logic Learnability in Language Models
Samuel Ryb | Mario Giulianelli | Arabella Sinclair | Raquel Fernández
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

We investigate the extent to which pre-trained language models acquire analytical and deductive logical reasoning capabilities as a side effect of learning word prediction. We present AnaLog, a natural language inference task designed to probe models for these capabilities, controlling for different invalid heuristics the models may adopt instead of learning the desired generalisations. We test four languagemodels on AnaLog, finding that they have all learned, to a different extent, to encode information that is predictive of entailment beyond shallow heuristics such as lexical overlap and grammaticality. We closely analyse the best performing language model and show that while it performs more consistently than other language models across logical connectives and reasoning domains, it still is sensitive to lexical and syntactic variations in the realisation of logical statements.

pdf bib
Construction Repetition Reduces Information Rate in Dialogue
Mario Giulianelli | Arabella Sinclair | Raquel Fernández
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Speakers repeat constructions frequently in dialogue. Due to their peculiar information-theoretic properties, repetitions can be thought of as a strategy for cost-effective communication. In this study, we focus on the repetition of lexicalised constructions—i.e., recurring multi-word units—in English open-domain spoken dialogues. We hypothesise that speakers use construction repetition to mitigate information rate, leading to an overall decrease in utterance information content over the course of a dialogue. We conduct a quantitative analysis, measuring the information content of constructions and that of their containing utterances, estimating information content with an adaptive neural language model. We observe that construction usage lowers the information content of utterances. This facilitating effect (i) increases throughout dialogues, (ii) is boosted by repetition, (iii) grows as a function of repetition frequency and density, and (iv) is stronger for repetitions of referential constructions.

pdf bib
Structural Persistence in Language Models: Priming as a Window into Abstract Language Representations
Arabella Sinclair | Jaap Jumelet | Willem Zuidema | Raquel Fernández
Transactions of the Association for Computational Linguistics, Volume 10

We investigate the extent to which modern neural language models are susceptible to structural priming, the phenomenon whereby the structure of a sentence makes the same structure more probable in a follow-up sentence. We explore how priming can be used to study the potential of these models to learn abstract structural information, which is a prerequisite for good performance on tasks that require natural language understanding skills. We introduce a novel metric and release Prime-LM, a large corpus where we control for various linguistic factors that interact with priming strength. We find that Transformer models indeed show evidence of structural priming, but also that the generalizations they learned are to some extent modulated by semantic information. Our experiments also show that the representations acquired by the models may not only encode abstract sequential structure but involve certain level of hierarchical syntactic information. More generally, our study shows that the priming paradigm is a useful, additional tool for gaining insights into the capacities of language models and opens the door to future priming-based investigations that probe the model’s internal states.1

2021

pdf bib
Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation
Sandro Pezzelle | Ece Takmaz | Raquel Fernández
Transactions of the Association for Computational Linguistics, Volume 9

This study carries out a systematic intrinsic evaluation of the semantic representations learned by state-of-the-art pre-trained multimodal Transformers. These representations are claimed to be task-agnostic and shown to help on many downstream language-and-vision tasks. However, the extent to which they align with human semantic intuitions remains unclear. We experiment with various models and obtain static word representations from the contextualized ones they learn. We then evaluate them against the semantic judgments provided by human speakers. In line with previous evidence, we observe a generalized advantage of multimodal representations over language- only ones on concrete word pairs, but not on abstract ones. On the one hand, this confirms the effectiveness of these models to align language and vision, which results in better semantic representations for concepts that are grounded in images. On the other hand, models are shown to follow different representation learning patterns, which sheds some light on how and when they perform multimodal integration.

pdf bib
Probing Cross-Modal Representations in Multi-Step Relational Reasoning
Iuliia Parfenova | Desmond Elliott | Raquel Fernández | Sandro Pezzelle
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

We investigate the representations learned by vision and language models in tasks that require relational reasoning. Focusing on the problem of assessing the relative size of objects in abstract visual contexts, we analyse both one-step and two-step reasoning. For the latter, we construct a new dataset of three-image scenes and define a task that requires reasoning at the level of the individual images and across images in a scene. We probe the learned model representations using diagnostic classifiers. Our experiments show that pretrained multimodal transformer-based architectures can perform higher-level relational reasoning, and are able to learn representations for novel tasks and data that are very different from what was seen in pretraining.

pdf bib
Semantic shift in social networks
Bill Noble | Asad Sayeed | Raquel Fernández | Staffan Larsson
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Just as the meaning of words is tied to the communities in which they are used, so too is semantic change. But how does lexical semantic change manifest differently across different communities? In this work, we investigate the relationship between community structure and semantic change in 45 communities from the social media website Reddit. We use distributional methods to quantify lexical semantic change and induce a social network on communities, based on interactions between members. We explore the relationship between semantic change and the clustering coefficient of a community’s social network graph, as well as community size and stability. While none of these factors are found to be significant on their own, we report a significant effect of their three-way interaction. We also report on significant word-level effects of frequency and change in frequency, which replicate previous findings.

pdf bib
Analysing Human Strategies of Information Transmission as a Function of Discourse Context
Mario Giulianelli | Raquel Fernández
Proceedings of the 25th Conference on Computational Natural Language Learning

Speakers are thought to use rational information transmission strategies for efficient communication (Genzel and Charniak, 2002; Aylett and Turk, 2004; Jaeger and Levy, 2007). Previous work analysing these strategies in sentence production has failed to take into account how the information content of sentences varies as a function of the available discourse context. In this study, we estimate sentence information content within discourse context. We find that speakers transmit information at a stable rate—i.e., rationally—in English newspaper articles but that this rate decreases in spoken open domain and written task-oriented dialogues. We also observe that speakers’ choices are not oriented towards local uniformity of information, which is another hypothesised rational strategy. We suggest that a more faithful model of communication should explicitly include production costs and goal-oriented rewards.

pdf bib
Is Information Density Uniform in Task-Oriented Dialogues?
Mario Giulianelli | Arabella Sinclair | Raquel Fernández
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The Uniform Information Density principle states that speakers plan their utterances to reduce fluctuations in the density of the information transmitted. In this paper, we test whether, and within which contextual units this principle holds in task-oriented dialogues. We show that there is evidence supporting the principle in written dialogues where participants play a cooperative reference game as well as in spoken dialogues involving instruction giving and following. Our study underlines the importance of identifying the relevant contextual components, showing that information content increases particularly within topically and referentially related contextual units.

2020

pdf bib
Words are the Window to the Soul: Language-based User Representations for Fake News Detection
Marco Del Tredici | Raquel Fernández
Proceedings of the 28th International Conference on Computational Linguistics

Cognitive and social traits of individuals are reflected in language use. Moreover, individuals who are prone to spread fake news online often share common traits. Building on these ideas, we introduce a model that creates representations of individuals on social media based only on the language they produce, and use them to detect fake news. We show that language-based user representations are beneficial for this task. We also present an extended analysis of the language of fake news spreaders, showing that its main features are mostly domain independent and consistent across two English datasets. Finally, we exploit the relation between language use and connections in the social graph to assess the presence of the Echo Chamber effect in our data.

pdf bib
Analysing Lexical Semantic Change with Contextualised Word Representations
Mario Giulianelli | Marco Del Tredici | Raquel Fernández
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper presents the first unsupervised approach to lexical semantic change that makes use of contextualised word representations. We propose a novel method that exploits the BERT neural language model to obtain representations of word usages, clusters these representations into usage types, and measures change along time with three proposed metrics. We create a new evaluation dataset and show that the model representations and the detected semantic shifts are positively correlated with human judgements. Our extensive qualitative analysis demonstrates that our method captures a variety of synchronic and diachronic linguistic phenomena. We expect our work to inspire further research in this direction.

pdf bib
Proceedings of the 24th Conference on Computational Natural Language Learning
Raquel Fernández | Tal Linzen
Proceedings of the 24th Conference on Computational Natural Language Learning

pdf bib
Refer, Reuse, Reduce: Generating Subsequent References in Visual and Conversational Contexts
Ece Takmaz | Mario Giulianelli | Sandro Pezzelle | Arabella Sinclair | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Dialogue participants often refer to entities or situations repeatedly within a conversation, which contributes to its cohesiveness. Subsequent references exploit the common ground accumulated by the interlocutors and hence have several interesting properties, namely, they tend to be shorter and reuse expressions that were effective in previous mentions. In this paper, we tackle the generation of first and subsequent references in visually grounded dialogue. We propose a generation model that produces referring utterances grounded in both the visual and the conversational context. To assess the referring effectiveness of its output, we also implement a reference resolution system. Our experiments and analyses show that the model produces better, more effective referring utterances than a model not grounded in the dialogue context, and generates subsequent references that exhibit linguistic patterns akin to humans.

pdf bib
Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
Ece Takmaz | Sandro Pezzelle | Lisa Beinborn | Raquel Fernández
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled sequentially. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural—particularly when gaze is encoded with a dedicated recurrent component.

2019

pdf bib
The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue
Janosch Haber | Tim Baumgärtner | Ece Takmaz | Lieke Gelderloos | Elia Bruni | Raquel Fernández
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.

pdf bib
Psycholinguistics Meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering
Claudio Greco | Barbara Plank | Raquel Fernández | Raffaella Bernardi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study the issue of catastrophic forgetting in the context of neural multimodal approaches to Visual Question Answering (VQA). Motivated by evidence from psycholinguistics, we devise a set of linguistically-informed VQA tasks, which differ by the types of questions involved (Wh-questions and polar questions). We test what impact task difficulty has on continual learning, and whether the order in which a child acquires question types facilitates computational models. Our results show that dramatic forgetting is at play and that task difficulty and order matter. Two well-known current continual learning methods mitigate the problem only to a limiting degree.

pdf bib
Short-Term Meaning Shift: A Distributional Exploration
Marco Del Tredici | Raquel Fernández | Gemma Boleda
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present the first exploration of meaning shift over short periods of time in online communities using distributional representations. We create a small annotated dataset and use it to assess the performance of a standard model for meaning shift detection on short-term meaning shift. We find that the model has problems distinguishing meaning shift from referential phenomena, and propose a measure of contextual variability to remedy this.

pdf bib
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat
Ravi Shekhar | Aashish Venkatesh | Tim Baumgärtner | Elia Bruni | Barbara Plank | Raffaella Bernardi | Raquel Fernández
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to identify an object in a complex visual scene by asking a sequence of yes/no questions. Our visually-grounded encoder leverages synergies between guessing and asking questions, as it is trained jointly using multi-task learning. We further enrich our model via a cooperative learning regime. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system. We compare our approach to an alternative system which extends the baseline with reinforcement learning. Our in-depth analysis shows that the linguistic skills of the two models differ dramatically, despite approaching comparable performance levels. This points at the importance of analyzing the linguistic output of competing systems beyond numeric comparison solely based on task success.

pdf bib
Is the Red Square Big? MALeViC: Modeling Adjectives Leveraging Visual Contexts
Sandro Pezzelle | Raquel Fernández
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This work aims at modeling how the meaning of gradable adjectives of size (‘big’, ‘small’) can be learned from visually-grounded contexts. Inspired by cognitive and linguistic evidence showing that the use of these expressions relies on setting a threshold that is dependent on a specific context, we investigate the ability of multi-modal models in assessing whether an object is ‘big’ or ‘small’ in a given visual scene. In contrast with the standard computational approach that simplistically treats gradable adjectives as ‘fixed’ attributes, we pose the problem as relational: to be successful, a model has to consider the full visual context. By means of four main tasks, we show that state-of-the-art models (but not a relatively strong baseline) can learn the function subtending the meaning of size adjectives, though their performance is found to decrease while moving from simple to more complex tasks. Crucially, models fail in developing abstract representations of gradable adjectives that can be used compositionally.

pdf bib
You Shall Know a User by the Company It Keeps: Dynamic Representations for Social Media Users in NLP
Marco Del Tredici | Diego Marcheggiani | Sabine Schulte im Walde | Raquel Fernández
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Information about individuals can help to better understand what they say, particularly in social media where texts are short. Current approaches to modelling social media users pay attention to their social connections, but exploit this information in a static way, treating all connections uniformly. This ignores the fact, well known in sociolinguistics, that an individual may be part of several communities which are not equally relevant in all communicative situations. We present a model based on Graph Attention Networks that captures this observation. It dynamically explores the social graph of a user, computes a user representation given the most relevant connections for a target task, and combines it with linguistic information to make a prediction. We apply our model to three different tasks, evaluate it against alternative models, and analyse the results extensively, showing that it significantly outperforms other current methods.

pdf bib
Big Generalizations with Small Data: Exploring the Role of Training Samples in Learning Adjectives of Size
Sandro Pezzelle | Raquel Fernández
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

In this paper, we experiment with a recently proposed visual reasoning task dealing with quantities – modeling the multimodal, contextually-dependent meaning of size adjectives (‘big’, ‘small’) – and explore the impact of varying the training data on the learning behavior of a state-of-art system. In previous work, models have been shown to fail in generalizing to unseen adjective-noun combinations. Here, we investigate whether, and to what extent, seeing some of these cases during training helps a model understand the rule subtending the task, i.e., that being big implies being not small, and vice versa. We show that relatively few examples are enough to understand this relationship, and that developing a specific, mutually exclusive representation of size adjectives is beneficial to the task.

pdf bib
Evaluating the Representational Hub of Language and Vision Models
Ravi Shekhar | Ece Takmaz | Raquel Fernández | Raffaella Bernardi
Proceedings of the 13th International Conference on Computational Semantics - Long Papers

The multimodal models used in the emerging field at the intersection of computational linguistics and computer vision implement the bottom-up processing of the “Hub and Spoke” architecture proposed in cognitive science to represent how the brain processes and combines multi-sensory inputs. In particular, the Hub is implemented as a neural network encoder. We investigate the effect on this encoder of various vision-and-language tasks proposed in the literature: visual question answering, visual reference resolution, and visually grounded dialogue. To measure the quality of the representations learned by the encoder, we use two kinds of analyses. First, we evaluate the encoder pre-trained on the different vision-and-language tasks on an existing “diagnostic task” designed to assess multimodal semantic understanding. Second, we carry out a battery of analyses aimed at studying how the encoder merges and exploits the two modalities.

pdf bib
Proceedings of the Second Workshop on Shortcomings in Vision and Language
Raffaella Bernardi | Raquel Fernandez | Spandana Gella | Kushal Kafle | Christopher Kanan | Stefan Lee | Moin Nabi
Proceedings of the Second Workshop on Shortcomings in Vision and Language

2018

pdf bib
Ask No More: Deciding when to guess in referential visual dialogue
Ravi Shekhar | Tim Baumgärtner | Aashish Venkatesh | Elia Bruni | Raffaella Bernardi | Raquel Fernandez
Proceedings of the 27th International Conference on Computational Linguistics

Our goal is to explore how the abilities brought in by a dialogue manager can be included in end-to-end visually grounded conversational agents. We make initial steps towards this general goal by augmenting a task-oriented visual dialogue model with a decision-making component that decides whether to ask a follow-up question to identify a target referent in an image, or to stop the conversation to make a guess. Our analyses show that adding a decision making component produces dialogues that are less repetitive and that include fewer unnecessary questions, thus potentially leading to more efficient and less unnatural interactions.

pdf bib
The Road to Success: Assessing the Fate of Linguistic Innovations in Online Communities
Marco Del Tredici | Raquel Fernández
Proceedings of the 27th International Conference on Computational Linguistics

We investigate the birth and diffusion of lexical innovations in a large dataset of online social communities. We build on sociolinguistic theories and focus on the relation between the spread of a novel term and the social role of the individuals who use it, uncovering characteristics of innovators and adopters. Finally, we perform a prediction task that allows us to anticipate whether an innovation will successfully spread within a community.

pdf bib
Analysing the potential of seq-to-seq models for incremental interpretation in task-oriented dialogue
Dieuwke Hupkes | Sanne Bouwmeester | Raquel Fernández
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We investigate how encoder-decoder models trained on a synthetic dataset of task-oriented dialogues process disfluencies, such as hesitations and self-corrections. We find that, contrary to earlier results, disfluencies have very little impact on the task success of seq-to-seq models with attention. Using visualisations and diagnostic classifiers, we analyse the representations that are incrementally built by the model, and discover that models develop little to no awareness of the structure of disfluencies. However, adding disfluencies to the data appears to help the model create clearer representations overall, as evidenced by the attention patterns the different models exhibit.

pdf bib
Automatic Evaluation of Neural Personality-based Chatbots
Yujie Xing | Raquel Fernández
Proceedings of the 11th International Conference on Natural Language Generation

Stylistic variation is critical to render the utterances generated by conversational agents natural and engaging. In this paper, we focus on sequence-to-sequence models for open-domain dialogue response generation and propose a new method to evaluate the extent to which such models are able to generate responses that reflect different personality traits.

2017

pdf bib
Adversarial evaluation for open-domain dialogue generation
Elia Bruni | Raquel Fernández
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

We investigate the potential of adversarial evaluation methods for open-domain dialogue generation systems, comparing the performance of a discriminative agent to that of humans on the same task. Our results show that the task is hard, both for automated models and humans, but that a discriminative agent can learn patterns that lead to above-chance performance.

pdf bib
Semantic Variation in Online Communities of Practice
Marco Del Tredici | Raquel Fernández
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Long papers

2016

pdf bib
Linguistic Style Accommodation in Disagreements
Elise van der Pol | Sharon Gieske | Raquel Fernández
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
Questioning Arbitrariness in Language: a Data-Driven Study of Conventional Iconicity
Ekaterina Abramova | Raquel Fernández
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Multimodal Semantic Learning from Child-Directed Input
Angeliki Lazaridou | Grzegorz Chrupała | Raquel Fernández | Marco Baroni
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Raquel Fernandez | Wolfgang Minker | Giuseppe Carenini | Ryuichiro Higashinaka | Ron Artstein | Alesia Gainer
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
PentoRef: A Corpus of Spoken References in Task-oriented Dialogues
Sina Zarrieß | Julian Hough | Casey Kennington | Ramesh Manuvinakurike | David DeVault | Raquel Fernández | David Schlangen
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

PentoRef is a corpus of task-oriented dialogues collected in systematically manipulated settings. The corpus is multilingual, with English and German sections, and overall comprises more than 20000 utterances. The dialogues are fully transcribed and annotated with referring expressions mapped to objects in corresponding visual scenes, which makes the corpus a rich resource for research on spoken referring expressions in generation and resolution. The corpus includes several sub-corpora that correspond to different dialogue situations where parameters related to interactivity, visual access, and verbal channel have been manipulated in systematic ways. The corpus thus lends itself to very targeted studies of reference in spontaneous dialogue.

pdf bib
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno | Germán Kruszewski | Angeliki Lazaridou | Ngoc Quan Pham | Raffaella Bernardi | Sandro Pezzelle | Marco Baroni | Gemma Boleda | Raquel Fernández
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Data-driven Investigation of Corrective Feedback on Subject Omission Errors in First Language Acquisition
Sarah Hiller | Raquel Fernández
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

2015

pdf bib
Clarifying Intentions in Dialogue: A Corpus Study
Julian J. Schlöder | Raquel Fernández
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Pragmatic Rejection
Julian J. Schlöder | Raquel Fernández
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Centre Stage: How Social Network Position Shapes Linguistic Coordination
Bill Noble | Raquel Fernández
Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics

pdf bib
Distributional Semantics in Use
Raffaella Bernardi | Gemma Boleda | Raquel Fernández | Denis Paperno
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

2014

pdf bib
Vagueness and Learning: A Type-Theoretic Approach
Raquel Fernández | Staffan Larsson
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
The Role of Polarity in Inferring Acceptance and Rejection in Dialogue
Julian Schlöder | Raquel Fernández
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf bib
Empirical Analysis of Aggregation Methods for Collective Annotation
Ciyang Qing | Ulle Endriss | Raquel Fernández | Justin Kruger
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

pdf bib
Collective Annotation of Linguistic Resources: Basic Principles and a Formal Model
Ulle Endriss | Raquel Fernández
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Generation of Quantified Referring Expressions: Evidence from Experimental Data
Dale Barr | Kees van Deemter | Raquel Fernández
Proceedings of the 14th European Workshop on Natural Language Generation

2012

pdf bib
Building a Corpus of Indefinite Uses Annotated with Fine-grained Semantic Functions
Maria Aloni | Andreas van Cranenburgh | Raquel Fernández | Marta Sznajder
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Natural languages possess a wealth of indefinite forms that typically differ in distribution and interpretation. Although formal semanticists have strived to develop precise meaning representations for different indefinite functions, to date there has hardly been any corpus work on the topic. In this paper, we present the results of a small corpus study where English indefinite forms `any' and `some' were labelled with fine-grained semantic functions well-motivated by typological studies. We developed annotation guidelines that could be used by non-expert annotators and calculated inter-annotator agreement amongst several coders. The results show that the annotation task is hard, with agreement scores ranging from 52% to 62% depending on the number of functions considered, but also that each of the independent annotations is in accordance with theoretical predictions regarding the possible distributions of indefinite functions. The resulting annotated corpus is available upon request and can be accessed through a searchable online database.

pdf bib
Towards a Flexible Semantics: Colour Terms in Collaborative Reference Tasks
Bert Baumgaertner | Raquel Fernández | Matthew Stone
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2009

pdf bib
Who is “You”? Combining Linguistic and Gaze Features to Resolve Second-Person References in Dialogue
Matthew Frampton | Raquel Fernández | Patrick Ehlen | Mario Christoudias | Trevor Darrell | Stanley Peters
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Cascaded Lexicalised Classifiers for Second-Person Reference Resolution
Matthew Purver | Raquel Fernández | Matthew Frampton | Stanley Peters
Proceedings of the SIGDIAL 2009 Conference

2008

pdf bib
Modelling and Detecting Decisions in Multi-party Dialogue
Raquel Fernández | Matthew Frampton | Patrick Ehlen | Matthew Purver | Stanley Peters
Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue

2007

pdf bib
Classifying Non-Sentential Utterances in Dialogue: A Machine Learning Approach
Raquel Fernández | Jonathan Ginzburg | Shalom Lappin
Computational Linguistics, Volume 33, Number 3, September 2007

pdf bib
An Implemented Method for Distributed Collection and Assessment of Speech Data
Alexander Siebert | David Schlangen | Raquel Fernández
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

pdf bib
Beyond Repair – Testing the Limits of the Conversational Repair System
David Schlangen | Raquel Fernández
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

pdf bib
Referring under Restricted Interactivity Conditions
Raquel Fernández | Tatjana Lucht | David Schlangen
Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue

2005

pdf bib
Using Machine Learning for Non-Sentential Utterance Classification
Raquel Fernández | Jonathan Ginzburg | Shalom Lappin
Proceedings of the 6th SIGdial Workshop on Discourse and Dialogue

pdf bib
Scaling up from Dialogue to Multilogue: Some Principles and Benchmarks
Jonathan Ginzburg | Raquel Fernández
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

pdf bib
Classifying Ellipsis in Dialogue: A Machine Learning Approach
Raquel Fernández | Jonathan Ginzburg | Shalom Lappin
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
A Dynamic Logic Formalisation of the Dialogue Gameboard
Raquel Fernández
Student Research Workshop

2002

pdf bib
Non-Sentential Utterances in Dialogue: A: Corpus-Based Study
Raquel Fernandez | Jonathan Ginzburg
Proceedings of the Third SIGdial Workshop on Discourse and Dialogue

pdf bib
Non-Sentential Utterances: Grammar and Dialogue Dynamics in Corpus Annotation
Raquel Fernández | Jonathan Ginzburg
COLING 2002: The 19th International Conference on Computational Linguistics

Search
Co-authors