Natalie Schluter - ACL Anthology

Natalie Schluter

2025

Training Bilingual LMs with Data Constraints in the Targeted Language
Skyler Seto | Maartje Ter Hoeve | Richard He Bai | Natalie Schluter | David Grangier
Findings of the Association for Computational Linguistics: ACL 2025

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high quality pretraining data is unavailable. In this work, we study how to boost pretrained model performance in a target language with insufficient pretraining data for training a high performing language model by enlisting data from an auxiliary language for which high quality data is available. We study this by quantifying the performance gap between training with data in a data-rich auxiliary language compared with training in the target language, exploring the benefits of translation systems, studying the limitations of model scaling when data is limited in the target languages, and proposing new methods for upsampling data from the auxiliary language. Our results show that stronger auxiliary datasets result in performance gains without modification to the model or training objective for close languages, and, in particular, that performance gains due to the development of more information-rich English pretraining datasets can extend to targeted language settings with limited data.

Discriminating Form and Meaning in Multilingual Models with Minimal-Pair ABX Tasks
Maureen de Seyssel | Jie Chi | Skyler Seto | Maartje Ter Hoeve | Masha Fedzechkina | Natalie Schluter
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce a set of training-free ABX-style discrimination tasks to evaluate how multilingual language models represent language identity (form) and semantic content (meaning). Inspired from speech processing, these zero-shot tasks measure whether minimal differences in representation can be reliably detected. This offers a flexible and interpretable alternative to probing. Applied to XLM-R (Conneau et al, 2020) across pretraining checkpoints and layers, we find that language discrimination declines over training and becomes concentrated in lower layers, while meaning discrimination strengthens over time and stabilizes in deeper layers. We then explore probing tasks, showing some alignment between our metrics and linguistic learning performance.Our results position ABX tasks as a lightweight framework for analyzing the structure of multilingual representations.

Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
David Sasu | Zehui Wu | Ziwei Gong | Run Chen | Pengyuan Shi | Lin Ai | Julia Hirschberg | Natalie Schluter
Findings of the Association for Computational Linguistics: ACL 2025

In this paper, we introduce the Akan Cinematic Emotions (AkaCE) dataset, the first multimodal emotion dialogue dataset for an African language, addressing the significant lack of resources for low-resource languages in emotion recognition research. AkaCE, developed for the Akan language, contains 385 emotion-labeled dialogues and 6162 utterances across audio, visual, and textual modalities, along with word-level prosodic prominence annotations. The presence of prosodic labels in this dataset also makes it the first prosodically annotated African language dataset. We demonstrate the quality and utility of AkaCE through experiments using state-of-the-art emotion recognition methods, establishing solid baselines for future research. We hope AkaCE inspires further work on inclusive, linguistically and culturally diverse NLP resources.

The Role of Prosody in Spoken Question Answering
Jie Chi | Maureen de Seyssel | Natalie Schluter
Findings of the Association for Computational Linguistics: NAACL 2025

Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody–additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.

Does Context Matter? A Prosodic Comparison of English and Spanish in Monolingual and Multilingual Discourse Settings
Debasmita Bhattacharya | David Sasu | Michela Marchini | Natalie Schluter | Julia Hirschberg
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Different languages are known to have typical and distinctive prosodic profiles. However, the majority of work on prosody across languages has been restricted to monolingual discourse contexts. We build on prior studies by asking: how does the nature of the discourse context influence variations in the prosody of monolingual speech? To answer this question, we compare the prosody of spontaneous, conversational monolingual English and Spanish both in monolingual and in multilingual speech settings. For both languages, we find that monolingual speech produced in a monolingual context is prosodically different from that produced in a multilingual context, with more marked differences having increased proximity to multilingual discourse. Our work is the first to incorporate multilingual discourse contexts into the study of native-level monolingual prosody, and has potential downstream applications for the recognition and synthesis of multilingual speech.

GrammaMT: Improving Machine Translation with Grammar-Informed In-Context Learning
Rita Ramos | Everlyn Asiko Chimoto | Maartje Ter Hoeve | Natalie Schluter
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce GrammaMT, a grammatically-aware prompting approach for machine translation that uses Interlinear Glossed Text (IGT), a common form of linguistic description providing morphological and lexical annotations for source sentences. GrammaMT proposes three prompting strategies: gloss-shot, chain-gloss and model-gloss. All are training-free, requiring only a few examples that involve minimal effort to collect, and making them well-suited for low-resource setups. Experiments show that GrammaMT enhances translation performance on open-source instruction-tuned LLMs for various low- to high-resource languages across three benchmarks: (1) the largest IGT corpus, (2) the challenging 2023 SIGMORPHON Shared Task data over endangered languages, and (3) even in an out-of-domain setting with FLORES. Moreover, ablation studies reveal that leveraging gloss resources could substantially boost MT performance (by over 17 BLEU points) if LLMs accurately generate or access input sentence glosses.

2024

Construction of Paired Knowledge Graph - Text Datasets Informed by Cyclic Evaluation
Ali Mousavi | Xin Zhan | He Bai | Peng Shi | Theodoros Rekatsinas | Benjamin Han | Yunyao Li | Jeffrey Pound | Joshua M. Susskind | Natalie Schluter | Ihab F. Ilyas | Navdeep Jaitly
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Informed by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.

2022

Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)
Guy Emerson | Natalie Schluter | Gabriel Stanovsky | Ritesh Kumar | Alexis Palmer | Nathan Schneider | Siddharth Singh | Shyam Ratan
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

2021

Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Alexis Palmer | Nathan Schneider | Natalie Schluter | Guy Emerson | Aurelie Herbelot | Xiaodan Zhu
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset
Daniel Varab | Natalie Schluter
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Current research in automatic summarisation is unapologetically anglo-centered–a persistent state-of-affairs, which also predates neural net approaches. High-quality automatic summarisation datasets are notoriously expensive to create, posing a challenge for any language. However, with digitalisation, archiving, and social media advertising of newswire articles, recent work has shown how, with careful methodology application, large-scale datasets can now be simply gathered instead of written. In this paper, we present a large-scale multilingual summarisation dataset containing articles in 92 languages, spread across 28.8 million articles, in more than 35 writing scripts. This is both the largest, most inclusive, existing automatic summarisation dataset, as well as one of the largest, most inclusive, ever published datasets for any NLP task. We present the first investigation on the efficacy of resource building from news platforms in the low-resource language setting. Finally, we provide some first insight on how low-resource language settings impact state-of-the-art automatic summarisation system performance.

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Dan Jurafsky | Joyce Chai | Natalie Schluter | Joel Tetreault
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

DaNewsroom: A Large-scale Danish Summarisation Dataset
Daniel Varab | Natalie Schluter
Proceedings of the Twelfth Language Resources and Evaluation Conference

Dataset development for automatic summarisation systems is notoriously English-oriented. In this paper we present the first large-scale non-English language dataset specifically curated for automatic summarisation. The document-summary pairs are news articles and manually written summaries in the Danish language. There has previously been no work done to establish a Danish summarisation dataset, nor any published work on the automatic summarisation of Danish. We provide therefore the first automatic summarisation dataset for the Danish language (large-scale or otherwise). To support the comparison of future automatic summarisation systems for Danish, we include system performance on this dataset of strong well-established unsupervised baseline systems, together with an oracle extractive summariser, which is the first account of automatic summarisation system performance for Danish. Finally, we make all code for automatically acquiring the data freely available and make explicit how this technology can easily be adapted in order to acquire automatic summarisation datasets for further languages.

2019

Recurrent models and lower bounds for projective syntactic decoding
Natalie Schluter
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The current state-of-the-art in neural graph-based parsing uses only approximate decoding at the training phase. In this paper aim to understand this result better. We show how recurrent models can carry out projective maximum spanning tree decoding. This result holds for both current state-of-the-art models for shift-reduce and graph-based parsers, projective or not. We also provide the first proof on the lower bounds of projective maximum spanning tree decoding.

The Lacunae of Danish Natural Language Processing
Andreas Kirkedal | Barbara Plank | Leon Derczynski | Natalie Schluter
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Danish is a North Germanic language spoken principally in Denmark, a country with a long tradition of technological and scientific innovation. However, the language has received relatively little attention from a technological perspective. In this paper, we review Natural Language Processing (NLP) research, digital resources and tools which have been developed for Danish. We find that availability of models and tools is limited, which calls for work that lifts Danish NLP a step closer to the privileged languages. Dansk abstrakt: Dansk er et nordgermansk sprog, talt primært i kongeriget Danmark, et land med stærk tradition for teknologisk og videnskabelig innovation. Det danske sprog har imidlertid været genstand for relativt begrænset opmærksomhed, teknologisk set. I denne artikel gennemgår vi sprogteknologi-forskning, -ressourcer og -værktøjer udviklet for dansk. Vi konkluderer at der eksisterer et fåtal af modeller og værktøjer, hvilket indbyder til forskning som løfter dansk sprogteknologi i niveau med mere priviligerede sprog.

UniParse: A universal graph-based parsing toolkit
Daniel Varab | Natalie Schluter
Proceedings of the 22nd Nordic Conference on Computational Linguistics

This paper describes the design and use of the graph-based parsing framework and toolkit UniParse, released as an open-source python software package. UniParse as a framework novelly streamlines research prototyping, development and evaluation of graph-based dependency parsing architectures. UniParse does this by enabling highly efficient, sufficiently independent, easily readable, and easily extensible implementations for all dependency parser components. We distribute the toolkit with ready-made configurations as re-implementations of all current state-of-the-art first-order graph-based parsers, including even more efficient Cython implementations of both encoders and decoders, as well as the required specialised loss functions.

2018

When data permutations are pathological: the case of neural natural language inference
Natalie Schluter | Daniel Varab
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Consider two competitive machine learning models, one of which was considered state-of-the art, and the other a competitive baseline. Suppose that by just permuting the examples of the training set, say by reversing the original order, by shuffling, or by mini-batching, you could report substantially better/worst performance for the system of your choice, by multiple percentage points. In this paper, we illustrate this scenario for a trending NLP task: Natural Language Inference (NLI). We show that for the two central NLI corpora today, the learning process of neural systems is far too sensitive to permutations of the data. In doing so we reopen the question of how to judge a good neural architecture for NLI, given the available dataset and perhaps, further, the soundness of the NLI task itself in its current state.

The glass ceiling in NLP
Natalie Schluter
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this paper, we provide empirical evidence based on a rigourously studied mathematical model for bi-populated networks, that a glass ceiling within the field of NLP has developed since the mid 2000s.

The Word Analogy Testing Caveat
Natalie Schluter
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

There are some important problems in the evaluation of word embeddings using standard word analogy tests. In particular, in virtue of the assumptions made by systems generating the embeddings, these remain tests over randomness. We show that even supposing there were such word analogy regularities that should be detected in the word embeddings obtained via unsupervised means, standard word analogy test implementation practices provide distorted or contrived results. We raise concerns regarding the use of Principal Component Analysis to 2 or 3 dimensions as a provision of visual evidence for the existence of word analogy relations in embeddings. Finally, we propose some solutions to these problems.

Baselines and Test Data for Cross-Lingual Inference
Željko Agić | Natalie Schluter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Empirically Sampling Universal Dependencies
Natalie Schluter | Željko Agić
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

How (not) to train a dependency parser: The curious case of jackknifing part-of-speech taggers
Željko Agić | Natalie Schluter
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In dependency parsing, jackknifing taggers is indiscriminately used as a simple adaptation strategy. Here, we empirically evaluate when and how (not) to use jackknifing in parsing. On 26 languages, we reveal a preference that conflicts with, and surpasses the ubiquitous ten-folding. We show no clear benefits of tagging the training data in cross-lingual parsing.

The limits of automatic summarisation according to ROUGE
Natalie Schluter
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper discusses some central caveats of summarisation, incurred in the use of the ROUGE metric for evaluation, with respect to optimal solutions. The task is NP-hard, of which we give the first proof. Still, as we show empirically for three central benchmark datasets for the task, greedy algorithms empirically seem to perform optimally according to the metric. Additionally, overall quality assurance is problematic: there is no natural upper bound on the quality of summarisation systems, and even humans are excluded from performing optimal summarisation.

2016

CoastalCPH at SemEval-2016 Task 11: The importance of designing your Neural Networks right
Joachim Bingel | Natalie Schluter | Héctor Martínez Alonso
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Multilingual Projection for Parsing Truly Low-Resource Languages
Željko Agić | Anders Johannsen | Barbara Plank | Héctor Martínez Alonso | Natalie Schluter | Anders Søgaard
Transactions of the Association for Computational Linguistics, Volume 4

We propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies, close to established upper bounds, and outperforms several competitive baselines.

Approximate unsupervised summary optimisation for selections of ROUGE
Natalie Schluter | Héctor Martínez Alonso
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)

Approximate summary optimisation for selections of ROUGE It is standard to measure automatic summariser performance using the ROUGE metric. Unfortunately, ROUGE is not appropriate for unsupervised summarisation approaches. On the other hand, we show that it is possible to optimise approximately for ROUGE-n by using a document-weighted ROUGE objective. Doing so results in state-of-the-art summariser performance for single and multiple document summaries for both English and French. This is despite a non-correlation of the documentweighted ROUGE metric with human judgments, unlike the original ROUGE metric. These findings suggest a theoretical approximation link between the two metrics.

2015

Effects of Graph Generation for Unsupervised Non-Contextual Single Document Keyword Extraction
Natalie Schluter
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

This paper presents an exhaustive study on the generation of graph input to unsupervised graph-based non-contextual single document keyword extraction systems. A concrete hypothesis on concept coordination for documents that are scientific articles is put forward, consistent with two separate graph models : one which is based on word adjacency in the linear text–an approach forming the foundation of all previous graph-based keyword extraction methods, and a novel one that is based on word adjacency modulo their modifiers. In doing so, we achieve a best reported NDCG score to date of 0.431 for any system on the same data. In terms of a best parameter f-score, we achieve the highest reported to date (0.714) at a reasonable ranked list cut-off of n = 6, which is also the best reported f-score for any keyword extraction or generation system in the literature on the same data. The best-parameter f-score corresponds to a reduction in error of 12.6% conservatively.

Maximising Spanning Subtree Scores for Parsing Tree Approximations of Semantic Dependency Digraphs
Natalie Schluter
Proceedings of the 14th International Conference on Parsing Technologies

A critical survey on measuring success in rank-based keyword assignment to documents
Natalie Schluter
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Evaluation approaches for unsupervised rank-based keyword assignment are nearly as numerous as are the existing systems. The prolific production of each newly used metric (or metric twist) seems to stem from general dis-satisfaction with the previous one and the source of that dissatisfaction has not previously been discussed in the literature. The difficulty may stem from a poor specification of the keyword assignment task in view of the rank-based approach. With a more complete specification of this task, we aim to show why the previous evaluation metrics fail to satisfy researchers’ goals to distinguish and detect good rank-based keyword assignment systems. We put forward a characterisation of an ideal evaluation metric, and discuss the consistency of the evaluation metrics with this ideal, finding that the average standard normalised cumulative gain metric is most consistent with this ideal.

Unsupervised extractive summarization via coverage maximization with syntactic and semantic concepts
Natalie Schluter | Anders Søgaard
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The complexity of finding the maximum spanning DAG and other restrictions for DAG parsing of natural language
Natalie Schluter
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

Copenhagen-Malmö: Tree Approximations of Semantic Parsing Problems
Natalie Schluter | Anders Søgaard | Jakob Elming | Dirk Hovy | Barbara Plank | Héctor Martínez Alonso | Anders Johanssen | Sigrid Klerke
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

On maximum spanning DAG algorithms for semantic DAG parsing
Natalie Schluter
Proceedings of the ACL 2014 Workshop on Semantic Parsing

Centrality Measures for Non-Contextual Graph-Based Unsupervised Single Document Keyword Extraction
Natalie Schluter
Proceedings of TALN 2014 (Volume 2: Short Papers)

2009

Dependency Parsing Resources for French: Converting Acquired Lexical Functional Grammar F-Structure Annotations and Parsing F-Structures Directly
Natalie Schluter | Josef van Genabith
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

Treebank-Based Acquisition of LFG Parsing Resources for French
Natalie Schluter | Josef van Genabith
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Motivated by the expense in time and other resources to produce hand-crafted grammars, there has been increased interest in automatically obtained wide-coverage grammars from treebanks for natural language processing. In particular, recent years have seen the growth in interest in automatically obtained deep resources that can represent information absent from simple CFG-type structured treebanks and which are considered to produce more language-neutral linguistic representations, such as dependency syntactic trees. As is often the case in early pioneering work on natural language processing, English has provided the focus of first efforts towards acquiring deep-grammar resources, followed by successful treatments of, for example, German, Japanese, Chinese and Spanish. However, no comparable large-scale automatically acquired deep-grammar resources have been obtained for French to date. The goal of this paper is to present the application of treebank-based language acquisition to the case of French. We show that with modest changes to the established parsing architectures, encouraging results can be obtained for French, with an overall best dependency structure f-score of 86.73%.

Co-authors

Venues