Dong Nguyen


2024

pdf bib
What’s Mine becomes Yours: Defining, Annotating and Detecting Context-Dependent Paraphrases in News Interview Dialogs
Anna Wegmann | Tijs A. Van Den Broek | Dong Nguyen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Best practices for high conflict conversations like counseling or customer support almost always include recommendations to paraphrase the previous speaker. Although paraphrase classification has received widespread attention in NLP, paraphrases are usually considered independent from context, and common models and datasets are not applicable to dialog settings. In this work, we investigate paraphrases across turns in dialog (e.g., Speaker 1: “That book is mine.” becomes Speaker 2: “That book is yours.”). We provide an operationalization of context-dependent paraphrases, and develop a training for crowd-workers to classify paraphrases in dialog. We introduce ContextDeP, a dataset with utterance pairs from NPR and CNN news interviews annotated for context-dependent paraphrases. To enable analyses on label variation, the dataset contains 5,581 annotations on 600 utterance pairs. We present promising results with in-context learning and with token classification models for automatic paraphrase detection in dialog.

pdf bib
Detecting Perspective-Getting in Wikipedia Discussions
Evgeny Vasilets | Tijs Broek | Anna Wegmann | David Abadi | Dong Nguyen
Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS 2024)

Perspective-getting (i.e., the effort to obtain information about the other person’s perspective) can lead to more accurate interpersonal understanding. In this paper, we develop an approach to measure perspective-getting and apply it to English Wikipedia discussions. First, we develop a codebook based on perspective-getting theory to operationalize perspective-getting into two categories: asking questions about and attending the other’s perspective. Second, we use the codebook to annotate perspective-getting in Wikipedia discussion pages. Third, we fine-tune a RoBERTa model that achieves an average F-1 score of 0.76 on the two perspective-getting categories. Last, we test whether perspective-getting is associated with discussion outcomes. Perspective-getting was not higher in non-escalated discussions. However, discussions starting with a post attending the other’s perspective are followed by responses that are more likely to also attend the other’s perspective. Future research may use our model to study the influence of perspective-getting on the dynamics and outcomes of online discussions.

2023

pdf bib
Epicurus at SemEval-2023 Task 4: Improving Prediction of Human Values behind Arguments by Leveraging Their Definitions
Christian Fang | Qixiang Fang | Dong Nguyen
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

We describe our experiments for SemEval-2023 Task 4 on the identification of human values behind arguments (ValueEval). Because human values are subjective concepts which require precise definitions, we hypothesize that incorporating the definitions of human values (in the form of annotation instructions and validated survey items) during model training can yield better prediction performance. We explore this idea and show that our proposed models perform better than the challenge organizers’ baselines, with improvements in macro F1 scores of up to 18%.

pdf bib
Measuring the Instability of Fine-Tuning
Yupei Du | Dong Nguyen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fine-tuning pre-trained language models on downstream tasks with varying random seeds has been shown to be unstable, especially on small datasets. Many previous studies have investigated this instability and proposed methods to mitigate it. However, most of these studies only used the standard deviation of performance scores (SD) as their measure, which is a narrow characterization of instability. In this paper, we analyze SD and six other measures quantifying instability of different granularity levels. Moreover, we propose a systematic evaluation framework of these measures’ validity. Finally, we analyze the consistency and difference between different measures by reassessing existing instability mitigation methods. We hope our results will inform better measurements of the fine-tuning instability.

2022

pdf bib
Same Author or Just Same Topic? Towards Content-Independent Style Representations
Anna Wegmann | Marijn Schraagen | Dong Nguyen
Proceedings of the 7th Workshop on Representation Learning for NLP

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV)”:” Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good “general-purpose” style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content.

pdf bib
Template-based Abstractive Microblog Opinion Summarization
Iman Munire Bilal | Bo Wang | Adam Tsakalidis | Dong Nguyen | Rob Procter | Maria Liakata
Transactions of the Association for Computational Linguistics, Volume 10

We introduce the task of microblog opinion summarization (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarization dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarizing news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favors extractive summarization models. To showcase the dataset’s utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarization models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.

2021

pdf bib
HateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger | Bertie Vidgen | Dong Nguyen | Zeerak Waseem | Helen Margetts | Janet Pierrehumbert
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.

pdf bib
On learning and representing social meaning in NLP: a sociolinguistic perspective
Dong Nguyen | Laura Rosseel | Jack Grieve
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The field of NLP has made substantial progress in building meaning representations. However, an important aspect of linguistic meaning, social meaning, has been largely overlooked. We introduce the concept of social meaning to NLP and discuss how insights from sociolinguistics can inform work on representation learning in NLP. We also identify key challenges for this new line of research.

pdf bib
Introducing CAD: the Contextual Abuse Dataset
Bertie Vidgen | Dong Nguyen | Helen Margetts | Patricia Rossini | Rebekah Tromble
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets. We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available.

pdf bib
Does It Capture STEL? A Modular, Similarity-based Linguistic Style Evaluation Framework
Anna Wegmann | Dong Nguyen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Style is an integral part of natural language. However, evaluation methods for style measures are rare, often task-specific and usually do not control for content. We propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL) to test the performance of any model that can compare two sentences on style. We illustrate STEL with two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contrac’tion and numb3r substitution). We find that BERT-based methods outperform simple versions of commonly used style measures like 3-grams, punctuation frequency and LIWC-based approaches. We invite the addition of further tasks and task instances to STEL and hope to facilitate the improvement of style-sensitive measures.

pdf bib
Assessing the Reliability of Word Embedding Gender Bias Measures
Yupei Du | Qixiang Fang | Dong Nguyen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Various measures have been proposed to quantify human-like social biases in word embeddings. However, bias scores based on these measures can suffer from measurement error. One indication of measurement quality is reliability, concerning the extent to which a measure produces consistent results. In this paper, we assess three types of reliability of word embedding gender bias measures, namely test-retest reliability, inter-rater consistency and internal consistency. Specifically, we investigate the consistency of bias scores across different choices of random seeds, scoring rules and words. Furthermore, we analyse the effects of various factors on these measures’ reliability scores. Our findings inform better design of word embedding gender bias measures. Moreover, we urge researchers to be more critical about the application of such measures

2020

pdf bib
Do Word Embeddings Capture Spelling Variation?
Dong Nguyen | Jack Grieve
Proceedings of the 28th International Conference on Computational Linguistics

Analyses of word embeddings have primarily focused on semantic and syntactic properties. However, word embeddings have the potential to encode other properties as well. In this paper, we propose a new perspective on the analysis of word embeddings by focusing on spelling variation. In social media, spelling variation is abundant and often socially meaningful. Here, we analyze word embeddings trained on Twitter and Reddit data. We present three analyses using pairs of word forms covering seven types of spelling variation in English. Taken together, our results show that word embeddings encode spelling variation patterns of various types to some extent, even embeddings trained using the skipgram model which does not take spelling into account. Our results also suggest a link between the intentionality of the variation and the distance of the non-conventional spellings to their conventional spellings.

pdf bib
tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection
Nicole Peinelt | Dong Nguyen | Maria Liakata
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Semantic similarity detection is a fundamental task in natural language understanding. Adding topic information has been useful for previous feature-engineered semantic similarity models as well as neural models for other tasks. There is currently no standard way of combining topics with pretrained contextual representations such as BERT. We propose a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and show that our model improves performance over strong neural baselines across a variety of English language datasets. We find that the addition of topics to BERT helps particularly with resolving domain-specific cases.

2019

pdf bib
Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
Nicole Peinelt | Maria Liakata | Dong Nguyen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity.

pdf bib
Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings
Philippa Shoemark | Farhana Ferdousi Liza | Dong Nguyen | Scott Hale | Barbara McGillivray
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings are increasingly used for the automatic detection of semantic change; yet, a robust evaluation and systematic comparison of the choices involved has been lacking. We propose a new evaluation framework for semantic change detection and find that (i) using the whole time series is preferable over only comparing between the first and last time points; (ii) independently trained and aligned embeddings perform better than continuously trained embeddings for long time periods; and (iii) that the reference point for comparison matters. We also present an analysis of the changes detected on a large Twitter dataset spanning 5.5 years.

pdf bib
Challenges and frontiers in abusive content detection
Bertie Vidgen | Alex Harris | Dong Nguyen | Rebekah Tromble | Scott Hale | Helen Margetts
Proceedings of the Third Workshop on Abusive Language Online

Online abusive content detection is an inherently difficult task. It has received considerable attention from academia, particularly within the computational linguistics community, and performance appears to have improved as the field has matured. However, considerable challenges and unaddressed frontiers remain, spanning technical, social and ethical dimensions. These issues constrain the performance, efficiency and generalizability of abusive content detection systems. In this article we delineate and clarify the main challenges and frontiers in the field, critically evaluate their implications and discuss potential solutions. We also highlight ways in which social scientific insights can advance research. We discuss the lack of support given to researchers working with abusive content and provide guidelines for ethical research.

2018

pdf bib
Comparing Automatic and Human Evaluation of Local Explanations for Text Classification
Dong Nguyen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Text classification models are becoming increasingly complex and opaque, however for many applications it is essential that the models are interpretable. Recently, a variety of approaches have been proposed for generating local explanations. While robust evaluations are needed to drive further progress, so far it is unclear which evaluation approaches are suitable. This paper is a first step towards more robust evaluations of local explanations. We evaluate a variety of local explanation approaches using automatic measures based on word deletion. Furthermore, we show that an evaluation using a crowdsourcing experiment correlates moderately with these automatic measures and that a variety of other factors also impact the human judgements.

2017

pdf bib
A Kernel Independence Test for Geographical Language Variation
Dong Nguyen | Jacob Eisenstein
Computational Linguistics, Volume 43, Issue 3 - September 2017

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

2016

pdf bib
Automatic Detection of Intra-Word Code-Switching
Dong Nguyen | Leonie Cornips
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Computational Sociolinguistics: A Survey
Dong Nguyen | A. Seza Doğruöz | Carolyn P. Rosé | Franciska de Jong
Computational Linguistics, Volume 42, Issue 3 - September 2016

2015

pdf bib
#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns
Dong Nguyen | Tijs van den Broek | Claudia Hauff | Djoerd Hiemstra | Michel Ehrenhard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
On the Impact of Twitter-based Health Campaigns: A Cross-Country Analysis of Movember
Nugroho Dwi Prasetyo | Claudia Hauff | Dong Nguyen | Tijs van den Broek | Djoerd Hiemstra
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Predicting Code-switching in Multilingual Communication for Immigrant Communities
Evangelos Papalexakis | Dong Nguyen | A. Seza Doğruöz
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
Dong Nguyen | Dolf Trieschnigg | A. Seza Doğruöz | Rilana Gravel | Mariët Theune | Theo Meder | Franciska de Jong
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
TweetGenie: Development, Evaluation, and Lessons Learned
Dong Nguyen | Dolf Trieschnigg | Theo Meder
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

2013

pdf bib
Word Level Language Identification in Online Multilingual Communication
Dong Nguyen | A. Seza Doğruöz
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning to Extract Folktale Keywords
Dolf Trieschnigg | Dong Nguyen | Mariët Theune
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2011

pdf bib
Language use as a reflection of socialization in online communities
Dong Nguyen | Carolyn P. Rosé
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Author Age Prediction from Text using Linear Regression
Dong Nguyen | Noah A. Smith | Carolyn P. Rosé
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities