Lonneke van der Plas

Also published as: Lonneke Van Der Plas


2024

pdf bib
A Comparison of Different Tokenization Methods for the Georgian Language
Beso Mikaberidze | Temo Saghinadze | Guram Mikaberidze | Raphael Kalandadze | Konstantine Pkhakadze | Josef van Genabith | Simon Ostermann | Lonneke van der Plas | Philipp Müller
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

2023

pdf bib
Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance
Molly Petersen | Lonneke van der Plas
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

While analogies are a common way to evaluate word embeddings in NLP, it is also of interest to investigate whether or not analogical reasoning is a task in itself that can be learned. In this paper, we test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans than those in commonly used NLP benchmarks. Our experiments find that models are able to learn analogical reasoning, even with a small amount of data. We additionally compare our models to a dataset with a human baseline, and find that after training models approach human performance.

pdf bib
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal | Sweta Agrawal | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | Mingda Chen | William Chen | Khalid Choukri | Alexandra Chronopoulou | Anna Currey | Thierry Declerck | Qianqian Dong | Kevin Duh | Yannick Estève | Marcello Federico | Souhir Gahbiche | Barry Haddow | Benjamin Hsu | Phu Mon Htut | Hirofumi Inaguma | Dávid Javorský | John Judge | Yasumasa Kano | Tom Ko | Rishu Kumar | Pengwei Li | Xutai Ma | Prashant Mathur | Evgeny Matusov | Paul McNamee | John P. McCrae | Kenton Murray | Maria Nadejde | Satoshi Nakamura | Matteo Negri | Ha Nguyen | Jan Niehues | Xing Niu | Atul Kr. Ojha | John E. Ortega | Proyag Pal | Juan Pino | Lonneke van der Plas | Peter Polák | Elijah Rippeth | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Yun Tang | Brian Thompson | Kevin Tran | Marco Turchi | Alex Waibel | Mingxuan Wang | Shinji Watanabe | Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

pdf bib
UM-DFKI Maltese Speech Translation
Aiden Williams | Kurt Abela | Rishu Kumar | Martin Bär | Hannah Billinghurst | Kurt Micallef | Ahnaf Mozib Samin | Andrea DeMarco | Lonneke van der Plas | Claudia Borg
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

For the 2023 IWSLT Maltese Speech Translation Task, UM-DFKI jointly presents a cascade solution which achieves 0.6 BLEU. While this is the first time that a Maltese speech translation task has been released by IWSLT, this paper explores previous solutions for other speech translation tasks, focusing primarily on low-resource scenarios. Moreover, we present our method of fine-tuning XLS-R models for Maltese ASR using a collection of multi-lingual speech corpora as well as the fine-tuning of the mBART model for Maltese to English machine translation.

2022

pdf bib
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Kurt Micallef | Albert Gatt | Marc Tanti | Lonneke van der Plas | Claudia Borg
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT – Maltese – with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks – dependency parsing, part-of-speech tagging, and named-entity recognition – and one semantic classification task – sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pretrained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

pdf bib
A Corpus and Evaluation for Predicting Semi-Structured Human Annotations
Andreas Marfurt | Ashley Thornton | David Sylvan | Lonneke van der Plas | James Henderson
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)

A wide variety of tasks have been framed as text-to-text tasks to allow processing by sequence-to-sequence models. We propose a new task of generating a semi-structured interpretation of a source document. The interpretation is semi-structured in that it contains mandatory and optional fields with free-text information. This structure is surfaced by human annotations, which we standardize and convert to text format. We then propose an evaluation technique that is generally applicable to any such semi-structured annotation, called equivalence classes evaluation. The evaluation technique is efficient and scalable; it creates a large number of evaluation instances from a comparably cheap clustering of the free-text information by domain experts. For our task, we release a dataset about the monetary policy of the Federal Reserve. On this corpus, our evaluation shows larger differences between pretrained models than standard text generation metrics.

2021

pdf bib
On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti | Lonneke van der Plas | Claudia Borg | Albert Gatt
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks – POS tagging and natural language inference – which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on ‘unlearning’ language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model’s limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

2020

pdf bib
Annotating for Hate Speech: The MaNeCo Corpus and Some Input from Critical Discourse Analysis
Stavros Assimakopoulos | Rebecca Vella Muskat | Lonneke van der Plas | Albert Gatt
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents a novel scheme for the annotation of hate speech in corpora of Web 2.0 commentary. The proposed scheme is motivated by the critical analysis of posts made in reaction to news reports on the Mediterranean migration crisis and LGBTIQ+ matters in Malta, which was conducted under the auspices of the EU-funded C.O.N.T.A.C.T. project. Based on the realisation that hate speech is not a clear-cut category to begin with, appears to belong to a continuum of discriminatory discourse and is often realised through the use of indirect linguistic means, it is argued that annotation schemes for its detection should refrain from directly including the label ‘hate speech,’ as different annotators might have different thresholds as to what constitutes hate speech and what not. In view of this, we propose a multi-layer annotation scheme, which is pilot-tested against a binary ±hate speech classification and appears to yield higher inter-annotator agreement. Motivating the postulation of our scheme, we then present the MaNeCo corpus on which it will eventually be used; a substantial corpus of on-line newspaper comments spanning 10 years.

pdf bib
MASRI-HEADSET: A Maltese Corpus for Speech Recognition
Carlos Daniel Hernandez Mena | Albert Gatt | Andrea DeMarco | Claudia Borg | Lonneke van der Plas | Amanda Muscat | Ian Padovani
Proceedings of the Twelfth Language Resources and Evaluation Conference

Maltese, the national language of Malta, is spoken by approximately 500,000 people. Speech processing for Maltese is still in its early stages of development. In this paper, we present the first spoken Maltese corpus designed purposely for Automatic Speech Recognition (ASR). The MASRI-HEADSET corpus was developed by the MASRI project at the University of Malta. It consists of 8 hours of speech paired with text, recorded by using short text snippets in a laboratory environment. The speakers were recruited from different geographical locations all over the Maltese islands, and were roughly evenly distributed by gender. This paper also presents some initial results achieved in baseline experiments for Maltese ASR using Sphinx and Kaldi. The MASRI HEADSET Corpus is publicly available for research/academic purposes.

pdf bib
Word Probability Findings in the Voynich Manuscript
Colin Layfield | Lonneke van der Plas | Michael Rosner | John Abela
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

The Voynich Manuscript has baffled scholars for centuries. Some believe the elaborate 15th century codex to be a hoax whilst others believe it is a real medieval manuscript whose contents are as yet unknown. In this paper, we provide additional evidence that the text of the manuscript displays the hallmarks of a proper natural language with respect to the relationship between word probabilities and (i) average information per subword segment and (ii) the relative positioning of consecutive subword segments necessary to uniquely identify words of different probabilities.

2019

pdf bib
Measuring the Compositionality of Noun-Noun Compounds over Time
Prajit Dhar | Janis Pagel | Lonneke van der Plas
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We present work in progress on the temporal progression of compositionality in noun-noun compounds. Previous work has proposed computational methods for determining the compositionality of compounds. These methods try to automatically determine how transparent the meaning of the compound as a whole is with respect to the meaning of its parts. We hypothesize that such a property might change over time. We use the time-stamped Google Books corpus for our diachronic investigations, and first examine whether the vector-based semantic spaces extracted from this corpus are able to predict compositionality ratings, despite their inherent limitations. We find that using temporal information helps predicting the ratings, although correlation with the ratings is lower than reported for other corpora. Finally, we show changes in compositionality over time for a selection of compounds.

pdf bib
Learning to Predict Novel Noun-Noun Compounds
Prajit Dhar | Lonneke van der Plas
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)

We introduce temporally and contextually-aware models for the novel task of predicting unseen but plausible concepts, as conveyed by noun-noun compounds in a time-stamped corpus. We train compositional models on observed compounds, more specifically the composed distributed representations of their constituents across a time-stamped corpus, while giving it corrupted instances (where head or modifier are replaced by a random constituent) as negative evidence. The model captures generalisations over this data and learns what combinations give rise to plausible compounds and which ones do not. After training, we query the model for the plausibility of automatically generated novel combinations and verify whether the classifications are accurate. For our best model, we find that in around 85% of the cases, the novel compounds generated are attested in previously unseen data. An additional estimated 5% are plausible despite not being attested in the recent corpus, based on judgments from independent human raters.

2018

pdf bib
Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions
Albert Gatt | Marc Tanti | Adrian Muscat | Patrizia Paggio | Reuben A Farrugia | Claudia Borg | Kenneth P Camilleri | Michael Rosner | Lonneke van der Plas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Survey: Multiword Expression Processing: A Survey
Mathieu Constant | Gülşen Eryiǧit | Johanna Monti | Lonneke van der Plas | Carlos Ramisch | Michael Rosner | Amalia Todirascu
Computational Linguistics, Volume 43, Issue 4 - December 2017

Multiword expressions (MWEs) are a class of linguistic forms spanning conventional word boundaries that are both idiosyncratic and pervasive across different languages. The structure of linguistic processing that depends on the clear distinction between words and phrases has to be re-thought to accommodate MWEs. The issue of MWE handling is crucial for NLP applications, where it raises a number of challenges. The emergence of solutions in the absence of guiding principles motivates this survey, whose aim is not only to provide a focused review of MWE processing, but also to clarify the nature of interactions between MWE processing and downstream applications. We propose a conceptual framework within which challenges and research contributions can be positioned. It offers a shared understanding of what is meant by “MWE processing,” distinguishing the subtasks of MWE discovery and identification. It also elucidates the interactions between MWE processing and two use cases: Parsing and machine translation. Many of the approaches in the literature can be differentiated according to how MWE processing is timed with respect to underlying use cases. We discuss how such orchestration choices affect the scope of MWE-aware systems. For each of the two MWE processing subtasks and for each of the two use cases, we conclude on open issues and research perspectives.

pdf bib
Evaluating Compound Splitters Extrinsically with Textual Entailment
Glorianna Jagfeld | Patrick Ziering | Lonneke van der Plas
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Traditionally, compound splitters are evaluated intrinsically on gold-standard data or extrinsically on the task of statistical machine translation. We explore a novel way for the extrinsic evaluation of compound splitters, namely recognizing textual entailment. Compound splitting has great potential for this novel task that is both transparent and well-defined. Moreover, we show that it addresses certain aspects that are either ignored in intrinsic evaluations or compensated for by taskinternal mechanisms in statistical machine translation. We show significant improvements using different compound splitting methods on a German textual entailment dataset.

pdf bib
LCT-MALTA’s Submission to RepEval 2017 Shared Task
Hoa Trong Vu | Thuong-Hai Pham | Xiaoyu Bai | Marc Tanti | Lonneke van der Plas | Albert Gatt
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.

2016

pdf bib
Towards Unsupervised and Language-independent Compound Splitting using Inflectional Morphological Transformations
Patrick Ziering | Lonneke van der Plas
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Top a Splitter: Using Distributional Semantics for Improving Compound Splitting
Patrick Ziering | Stefan Müller | Lonneke van der Plas
Proceedings of the 12th Workshop on Multiword Expressions

pdf bib
The Grammar of English Deverbal Compounds and their Meaning
Gianina Iordăchioaia | Lonneke van der Plas | Glorianna Jagfeld
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

We present an interdisciplinary study on the interaction between the interpretation of noun-noun deverbal compounds (DCs; e.g., task assignment) and the morphosyntactic properties of their deverbal heads in English. Underlying hypotheses from theoretical linguistics are tested with tools and resources from computational linguistics. We start with Grimshaw’s (1990) insight that deverbal nouns are ambiguous between argument-supporting nominal (ASN) readings, which inherit verbal arguments (e.g., the assignment of the tasks), and the less verbal and more lexicalized Result Nominal and Simple Event readings (e.g., a two-page assignment). Following Grimshaw, our hypothesis is that the former will realize object arguments in DCs, while the latter will receive a wider range of interpretations like root compounds headed by non-derived nouns (e.g., chocolate box). Evidence from a large corpus assisted by machine learning techniques confirms this hypothesis, by showing that, besides other features, the realization of internal arguments by deverbal heads outside compounds (i.e., the most distinctive ASN-property in Grimshaw 1990) is a good predictor for an object interpretation of non-heads in DCs.

2015

pdf bib
Towards a Better Semantic Role Labeling of Complex Predicates
Glorianna Jagfeld | Lonneke van der Plas
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
One Tree is not Enough: Cross-lingual Accumulative Structure Transfer for Semantic Indeterminacy
Patrick Ziering | Lonneke van der Plas
Proceedings of the International Conference Recent Advances in Natural Language Processing

pdf bib
From a Distance: Using Cross-lingual Word Alignments for Noun Compound Bracketing
Patrick Ziering | Lonneke van der Plas
Proceedings of the 11th International Conference on Computational Semantics

pdf bib
Predicting Pronouns across Languages with Continuous Word Spaces
Ngoc-Quan Pham | Lonneke van der Plas
Proceedings of the Second Workshop on Discourse in Machine Translation

2014

pdf bib
What good are ‘Nominalkomposita’ for ‘noun compounds’: Multilingual Extraction and Structure Analysis of Nominal Compositions using Linguistic Restrictors
Patrick Ziering | Lonneke van der Plas
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Global Methods for Cross-lingual Semantic Role and Predicate Labelling
Lonneke van der Plas | Marianna Apidianaki | Chenhua Chen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Cross-lingual Word Sense Disambiguation for Predicate Labelling of French
Lonneke van der Plas | Marianna Apidianaki
Proceedings of TALN 2014 (Volume 1: Long Papers)

2013

bib
Bitexts as Semantic Mirrors
Jörg Tiedemann | Lonneke van der Plas | Begoña Villada Moirón
Proceedings of the Workshop on Twenty Years of Bitext

pdf bib
Multilingual Lexicon Bootstrapping - Improving a Lexicon Induction System Using a Parallel Corpus
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Bootstrapping Semantic Lexicons for Technical Domains
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2011

pdf bib
Scaling up Automatic Cross-Lingual Semantic Role Annotation
Lonneke van der Plas | Paola Merlo | James Henderson
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Cross-Lingual Validity of PropBank in the Manual Annotation of French
Lonneke van der Plas | Tanja Samardžić | Paola Merlo
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Finding Medical Term Variations using Parallel Corpora and Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the 6th Workshop on Ontologies and Lexical Resources

2009

pdf bib
Domain Adaptation with Artificial Data for Semantic Parsing of Speech
Lonneke van der Plas | James Henderson | Paola Merlo
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

pdf bib
Abstraction and Generalisation in Semantic Role Labels: PropBank, VerbNet or both?
Paola Merlo | Lonneke Van Der Plas
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Combining Syntactic Co-occurrences and Nearest Neighbours in Distributional Methods to Remedy Data Sparseness.
Lonneke van der Plas
Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics

2008

pdf bib
Using Lexico-Semantic Information for Query Expansion in Passage Retrieval for Question Answering
Lonneke van der Plas | Jörg Tiedemann
Coling 2008: Proceedings of the 2nd workshop on Information Retrieval for Question Answering

2006

pdf bib
Finding Synonyms Using Automatic Word Alignment and Measures of Distributional Similarity
Lonneke van der Plas | Jörg Tiedemann
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Automatic Acquisition of Lexico-semantic Knowledge for QA
Lonneke van der Plas | Gosse Bouma
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf bib
Automatic Keyword Extraction from Spoken Text. A Comparison of Two Lexical Resources: EDR and WordNet
Lonneke van der Plas | Vincenzo Pallotta | Martin Rajman | Hatem Ghorbel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

Search
Co-authors
Fix data