Chris Emmery


pdf bib
BigNLI: Native Language Identification with Big Bird Embeddings
Sergey Kramp | Giovanni Cassani | Chris Emmery
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Native Language Identification (NLI) intends to classify an author’s native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.

pdf bib
SOBR: A Corpus for Stylometry, Obfuscation, and Bias on Reddit
Chris Emmery | Marilù Miotto | Sergey Kramp | Bennett Kleinberg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Sharing textual content in the form of public posts on online platforms remains a significant part of the social web. Research on stylometric profiling suggests that despite users’ discreetness, and even under the guise of anonymity, the content and style of such posts may still reveal detailed author information. Studying how this might be inferred and obscured is relevant not only to the domain of cybersecurity, but also to those studying bias of classifiers drawing features from web corpora. While the collection of gold standard data is expensive, prior work shows that distant labels (i.e., those gathered via heuristics) offer an effective alternative. Currently, however, pre-existing corpora are limited in scope (e.g., variety of attributes and size). We present the SOBR corpus: 235M Reddit posts for which we used subreddits, flairs, and self-reports as distant labels for author attributes (age, gender, nationality, personality, and political leaning). In addition to detailing the data collection pipeline and sampling strategy, we report corpus statistics and provide a discussion on the various tasks and research avenues to be pursued using this resource. Along with the raw corpus, we provide sampled splits of the data, and suggest baselines for stylometric profiling. We close our work with a detailed set of ethical considerations relevant to the proposed lines of research.


pdf bib
Tailoring Domain Adaptation for Machine Translation Quality Estimation
Javad Pourmostafa Roshan Sharami | Dimitar Shterionov | Frédéric Blain | Eva Vanmassenhove | Mirella De Sisto | Chris Emmery | Pieter Spronck
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high-cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizabile, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues — data scarcity and domain mismatch — this paper combines domain adaptation and data augmentation within a robust QE system. Our method is to first train a generic QE model and then fine-tune it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.

pdf bib
Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model
Chris van der Lee | Thiago Castro Ferreira | Chris Emmery | Travis J. Wiltshire | Emiel Krahmer
Computational Linguistics, Volume 49, Issue 3 - September 2023

This study discusses the effect of semi-supervised learning in combination with pretrained language models for data-to-text generation. It is not known whether semi-supervised learning is still helpful when a large-scale language model is also supplemented. This study aims to answer this question by comparing a data-to-text system only supplemented with a language model, to two data-to-text systems that are additionally enriched by a data augmentation or a pseudo-labeling semi-supervised learning approach. Results show that semi-supervised learning results in higher scores on diversity metrics. In terms of output quality, extending the training set of a data-to-text system with a language model using the pseudo-labeling approach did increase text quality scores, but the data augmentation approach yielded similar scores to the system without training set extension. These results indicate that semi-supervised learning approaches can bolster output quality and diversity, even when a language model is also present.


pdf bib
Cyberbullying Classifiers are Sensitive to Model-Agnostic Perturbations
Chris Emmery | Ákos Kádár | Grzegorz Chrupała | Walter Daelemans
Proceedings of the Thirteenth Language Resources and Evaluation Conference

A limited amount of studies investigates the role of model-agnostic adversarial behavior in toxic content classification. As toxicity classifiers predominantly rely on lexical cues, (deliberately) creative and evolving language-use can be detrimental to the utility of current corpora and state-of-the-art models when they are deployed for content moderation. The less training data is available, the more vulnerable models might become. This study is, to our knowledge, the first to investigate the effect of adversarial behavior and augmentation for cyberbullying detection. We demonstrate that model-agnostic lexical substitutions significantly hurt classifier performance. Moreover, when these perturbed samples are used for augmentation, we show models become robust against word-level perturbations at a slight trade-off in overall task performance. Augmentations proposed in prior work on toxicity prove to be less effective. Our results underline the need for such evaluations in online harm areas with small corpora.


pdf bib
NeuTral Rewriter: A Rule-Based and Neural Approach to Automatic Rewriting into Gender Neutral Alternatives
Eva Vanmassenhove | Chris Emmery | Dimitar Shterionov
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent years have seen an increasing need for gender-neutral and inclusive language. Within the field of NLP, there are various mono- and bilingual use cases where gender inclusive language is appropriate, if not preferred due to ambiguity or uncertainty in terms of the gender of referents. In this work, we present a rule-based and a neural approach to gender-neutral rewriting for English along with manually curated synthetic data (WinoBias+) and natural data (OpenSubtitles and Reddit) benchmarks. A detailed manual and automatic evaluation highlights how our NeuTral Rewriter, trained on data generated by the rule-based approach, obtains word error rates (WER) below 0.18% on synthetic, in-domain and out-domain test sets.

pdf bib
Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling
Chris Emmery | Ákos Kádár | Grzegorz Chrupała
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Written language contains stylistic cues that can be exploited to automatically infer a variety of potentially sensitive author information. Adversarial stylometry intends to attack such models by rewriting an author’s text. Our research proposes several components to facilitate deployment of these adversarial attacks in the wild, where neither data nor target models are accessible. We introduce a transformer-based extension of a lexical replacement attack, and show it achieves high transferability when trained on a weakly labeled corpus—decreasing target model performance below chance. While not completely inconspicuous, our more successful attacks also prove notably less detectable by humans. Our framework therefore provides a promising direction for future privacy-preserving adversarial attacks.


pdf bib
The CACAPO Dataset: A Multilingual, Multi-Domain Dataset for Neural Pipeline and End-to-End Data-to-Text Generation
Chris van der Lee | Chris Emmery | Sander Wubben | Emiel Krahmer
Proceedings of the 13th International Conference on Natural Language Generation

This paper describes the CACAPO dataset, built for training both neural pipeline and end-to-end data-to-text language generation systems. The dataset is multilingual (Dutch and English), and contains almost 10,000 sentences from human-written news texts in the sports, weather, stocks, and incidents domain, together with aligned attribute-value paired data. The dataset is unique in that the linguistic variation and indirect ways of expressing data in these texts reflect the challenges of real world NLG tasks.


pdf bib
Style Obfuscation by Invariance
Chris Emmery | Enrique Manjavacas Arevalo | Grzegorz Chrupała
Proceedings of the 27th International Conference on Computational Linguistics

The task of obfuscating writing style using sequence models has previously been investigated under the framework of obfuscation-by-transfer, where the input text is explicitly rewritten in another style. A side effect of this framework are the frequent major alterations to the semantic content of the input. In this work, we propose obfuscation-by-invariance, and investigate to what extent models trained to be explicitly style-invariant preserve semantics. We evaluate our architectures in parallel and non-parallel settings, and compare automatic and human evaluations on the obfuscated sentences. Our experiments show that the performance of a style classifier can be reduced to chance level, while the output is evaluated to be of equal quality to models applying style-transfer. Additionally, human evaluation indicates a trade-off between the level of obfuscation and the observed quality of the output in terms of meaning preservation and grammaticality.


pdf bib
Simple Queries as Distant Labels for Predicting Gender on Twitter
Chris Emmery | Grzegorz Chrupała | Walter Daelemans
Proceedings of the 3rd Workshop on Noisy User-generated Text

The majority of research on extracting missing user attributes from social media profiles use costly hand-annotated labels for supervised learning. Distantly supervised methods exist, although these generally rely on knowledge gathered using external sources. This paper demonstrates the effectiveness of gathering distant labels for self-reported gender on Twitter using simple queries. We confirm the reliability of this query heuristic by comparing with manual annotation. Moreover, using these labels for distant supervision, we demonstrate competitive model performance on the same data as models trained on manual annotations. As such, we offer a cheap, extensible, and fast alternative that can be employed beyond the task of gender classification.


pdf bib
Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource
Stéphan Tulkens | Chris Emmery | Walter Daelemans
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Word embeddings have recently seen a strong increase in interest as a result of strong performance gains on a variety of tasks. However, most of this research also underlined the importance of benchmark datasets, and the difficulty of constructing these for a variety of language-specific tasks. Still, many of the datasets used in these tasks could prove to be fruitful linguistic resources, allowing for unique observations into language use and variability. In this paper we demonstrate the performance of multiple types of embeddings, created with both count and prediction-based architectures on a variety of corpora, in two language-specific tasks: relation evaluation, and dialect identification. For the latter, we compare unsupervised methods with a traditional, hand-crafted dictionary. With this research, we provide the embeddings themselves, the relation evaluation task benchmark for use in further research, and demonstrate how the benchmarked embeddings prove a useful unsupervised linguistic resource, effectively used in a downstream task.


pdf bib
The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis.
Menno van Zaanen | Gerhard van Huyssteen | Suzanne Aussems | Chris Emmery | Roald Eiselen
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In most languages, new words can be created through the process of compounding, which combines two or more words into a new lexical unit. Whereas in languages such as English the components that make up a compound are separated by a space, in languages such as Finnish, German, Afrikaans and Dutch these components are concatenated into one word. Compounding is very productive and leads to practical problems in developing machine translators and spelling checkers, as newly formed compounds cannot be found in existing lexicons. The Automatic Compound Processing (AuCoPro) project deals with the analysis of compounds in two closely-related languages, Afrikaans and Dutch. In this paper, we present the development and evaluation of two datasets, one for each language, that contain compound words with annotated compound boundaries. Such datasets can be used to train classifiers to identify the compound components in novel compounds. We describe the process of annotation and provide an overview of the annotation guidelines as well as global properties of the datasets. The inter-rater agreements between the annotators are considered highly reliable. Furthermore, we show the usability of these datasets by building an initial automatic compound boundary detection system, which assigns compound boundaries with approximately 90% accuracy.