2024
pdf
bib
abs
NLP_DI at NADI 2024 shared task: Multi-label Arabic Dialect Classifications with an Unsupervised Cross-Encoder
Vani Kanjirangat
|
Tanja Samardzic
|
Ljiljana Dolamic
|
Fabio Rinaldi
Proceedings of The Second Arabic Natural Language Processing Conference
We report the approaches submitted to the NADI 2024 Subtask 1: Multi-label country-level Dialect Identification (MLDID). The core part was to adapt the information from multi-class data for a multi-label dialect classification task. We experimented with supervised and unsupervised strategies to tackle the task in this challenging setting. Under the supervised setup, we used the model trained using NADI 2023 data and devised approaches to convert the multi-class predictions to multi-label by using information from the confusion matrix or using calibrated probabilities. Under unsupervised settings, we used the Arabic-based sentence encoders and multilingual cross-encoders to retrieve similar samples from the training set, considering each test input as a query. The associated labels are then assigned to the input query. We also tried different variations, such as co-occurring dialects derived from the provided development set. We obtained the best validation performance of 48.5% F-score using one of the variations with an unsupervised approach and the same approach yielded the best test result of 43.27% (Ranked 2).
pdf
bib
abs
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation
Sahar Sadrizadeh
|
Ljiljana Dolamic
|
Pascal Frossard
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Neural Machine Translation (NMT) models have been shown to be vulnerable to adversarial attacks, wherein carefully crafted perturbations of the input can mislead the target model. In this paper, we introduce ACT, a novel adversarial attack framework against NMT systems guided by a classifier. In our attack, the adversary aims to craft meaning-preserving adversarial examples whose translations in the target language by the NMT model belong to a different class than the original translations. Unlike previous attacks, our new approach has a more substantial effect on the translation by altering the overall meaning, which then leads to a different class determined by an oracle classifier. To evaluate the robustness of NMT models to our attack, we propose enhancements to existing black-box word-replacement-based attacks by incorporating output translations of the target NMT model and the output logits of a classifier within the attack process. Extensive experiments, including a comparison with existing untargeted attacks, show that our attack is considerably more successful in altering the class of the output translation and has more effect on the translation. This new paradigm can reveal the vulnerabilities of NMT systems by focusing on the class of translation rather than the mere translation quality as studied traditionally.
pdf
bib
abs
BUST: Benchmark for the evaluation of detectors of LLM-Generated Text
Joseph Cornelius
|
Oscar Lithgow-Serrano
|
Sandra Mitrovic
|
Ljiljana Dolamic
|
Fabio Rinaldi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
We introduce BUST, a comprehensive benchmark designed to evaluate detectors of texts generated by instruction-tuned large language models (LLMs). Unlike previous benchmarks, our focus lies on evaluating the performance of detector systems, acknowledging the inevitable influence of the underlying tasks and different LLM generators. Our benchmark dataset consists of 25K texts from humans and 7 LLMs responding to instructions across 10 tasks from 3 diverse sources. Using the benchmark, we evaluated 5 detectors and found substantial performance variance across tasks. A meta-analysis of the dataset characteristics was conducted to guide the examination of detector performance. The dataset was analyzed using diverse metrics assessing linguistic features like fluency and coherence, readability scores, and writer attitudes, such as emotions, convincingness, and persuasiveness. Features impacting detector performance were investigated with surrogate models, revealing emotional content in texts enhanced some detectors, yet the most effective detector demonstrated consistent performance, irrespective of writer’s attitudes and text styles. Our approach focused on investigating relationships between the detectors’ performance and two key factors: text characteristics and LLM generators. We believe BUST will provide valuable insights into selecting detectors tailored to specific text styles and tasks and facilitate a more practical and in-depth investigation of detection systems for LLM-generated text.
pdf
bib
Dialect Identifications with Large Language Models
Vani Kanjirangat
|
Ljiljana Dolamic
|
Fabio Rinaldi
Proceedings of the 9th edition of the Swiss Text Analytics Conference
pdf
bib
Presenting BUST - A benchmark for the evaluation of system detectors of LLM-Generated Text
Joseph Cornelius
|
Oscar William Lithgow Serrano
|
Sandra Mitrović
|
Ljiljana Dolamic
|
Fabio Rinaldi
Proceedings of the 9th edition of the Swiss Text Analytics Conference
2023
pdf
bib
abs
Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT
Benoist Wolleb
|
Romain Silvestri
|
Georgios Vernikos
|
Ljiljana Dolamic
|
Andrei Popescu-Belis
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Subword tokenization is the de-facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently put forward in favor of subwords: shorter encoding of frequent tokens, compositionality of subwords, and ability to deal with unknown words. As their relative importance is not entirely clear yet, we propose a tokenization approach that enables us to separate frequency (the first advantage) from compositionality, thanks to the use of Huffman coding, which tokenizes words using a fixed amount of symbols. Experiments with CS-DE, EN-FR and EN-DE NMT show that frequency alone accounts for approximately 90% of the BLEU scores reached by BPE, hence compositionality has less importance than previously thought.
pdf
bib
abs
A Simplified Training Pipeline for Low-Resource and Unsupervised Machine Translation
Àlex R. Atrio
|
Alexis Allemann
|
Ljiljana Dolamic
|
Andrei Popescu-Belis
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)
Training neural MT systems for low-resource language pairs or in unsupervised settings (i.e. with no parallel data) often involves a large number of auxiliary systems. These may include parent systems trained on higher-resource pairs and used for initializing the parameters of child systems, multilingual systems for neighboring languages, and several stages of systems trained on pseudo-parallel data obtained through back-translation. We propose here a simplified pipeline, which we compare to the best submissions to the WMT 2021 Shared Task on Unsupervised MT and Very Low Resource Supervised MT. Our pipeline only needs two parents, two children, one round of back-translation for low-resource directions and two for unsupervised ones and obtains better or similar scores when compared to more complex alternatives.
pdf
bib
abs
Optimizing the Size of Subword Vocabularies in Dialect Classification
Vani Kanjirangat
|
Tanja Samardžić
|
Ljiljana Dolamic
|
Fabio Rinaldi
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Pre-trained models usually come with a pre-defined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granularity in four dialect datasets, two with relatively consistent writing (one Arabic and one Indo-Aryan set) and two with considerably inconsistent writing (one Arabic and one Swiss German set). To gain more control over subword tokenization and ensure direct comparability in the experimental settings, we train a CNN classifier from scratch comparing two subword tokenization methods (Unigram model and BPE). For reference, we compare the results obtained in our analysis to the state of the art achieved by fine-tuning pre-trained models. We show that models trained from scratch with an optimal tokenization level perform better than fine-tuned classifiers in the case of highly inconsistent writing. In the case of relatively consistent writing, fine-tuned models remain better regardless of the tokenization level.
2022
pdf
bib
abs
Early Guessing for Dialect Identification
Vani Kanjirangat
|
Tanja Samardzic
|
Fabio Rinaldi
|
Ljiljana Dolamic
Findings of the Association for Computational Linguistics: EMNLP 2022
This paper deals with the problem of incre-mental dialect identification. Our goal is toreliably determine the dialect before the fullutterance is given as input. The major partof the previous research on dialect identification has been model-centric, focusing on performance. We address a new question: How much input is needed to identify a dialect? Ourapproach is a data-centric analysis that resultsin general criteria for finding the shortest inputneeded to make a plausible guess. Workingwith three sets of language dialects (Swiss German, Indo-Aryan and Arabic languages), weshow that it is possible to generalize across dialects and datasets with two input shorteningcriteria: model confidence and minimal inputlength (adjusted for the input type). The sourcecode for experimental analysis can be found atGithub.
pdf
bib
abs
mattica@SMM4H’22: Leveraging sentiment for stance & premise joint learning
Oscar Lithgow-Serrano
|
Joseph Cornelius
|
Fabio Rinaldi
|
Ljiljana Dolamic
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
This paper describes our submissions to the Social Media Mining for Health Applications (SMM4H) shared task 2022. Our team (mattica) participated in detecting stances and premises in tweets about health mandates related to COVID-19 (Task 2). Our approach was based on using an in-domain Pretrained Language Model, which we fine-tuned by combining different strategies such as leveraging an additional stance detection dataset through two-stage fine-tuning, joint-learning Stance and Premise detection objectives; and ensembling the sentiment-polarity given by an off-the-shelf fine-tuned model.
pdf
bib
abs
NLP DI at NADI Shared Task Subtask-1: Sub-word Level Convolutional Neural Models and Pre-trained Binary Classifiers for Dialect Identification
Vani Kanjirangat
|
Tanja Samardzic
|
Ljiljana Dolamic
|
Fabio Rinaldi
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
In this paper, we describe our systems submitted to the NADI Subtask 1: country-wise dialect classifications. We designed two types of solutions. The first type is convolutional neural network CNN) classifiers trained on subword segments of optimized lengths. The second type is fine-tuned classifiers with BERT-based language specific pre-trained models. To deal with the missing dialects in one of the test sets, we experimented with binary classifiers, analyzing the predicted probability distribution patterns and comparing them with the development set patterns. The better performing approach on the development set was fine-tuning language specific pre-trained model (best F-score 26.59%). On the test set, on the other hand, we obtained the best performance with the CNN model trained on subword tokens obtained with a Unigram model (the best F-score 26.12%). Re-training models on samples of training data simulating missing dialects gave the maximum performance on the test set version with a number of dialects lesser than the training set (F-score 16.44%)
2021
pdf
bib
abs
The IICT-Yverdon System for the WMT 2021 Unsupervised MT and Very Low Resource Supervised MT Task
Àlex R. Atrio
|
Gabriel Luthier
|
Axel Fahy
|
Giorgos Vernikos
|
Andrei Popescu-Belis
|
Ljiljana Dolamic
Proceedings of the Sixth Conference on Machine Translation
In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by techniques such as back-translation and initialization from a parent model. We find that both techniques are beneficial and suffice to reach performance that compares with more sophisticated systems from the 2020 task. We then present the application of this system to the 2021 task for low-resource supervised Upper Sorbian (HSB) to German translation, in both directions. Finally, we present a contrastive system for HSB-DE in both directions, and for unsupervised German to Lower Sorbian (DSB) translation, which uses multi-task training with various training schedules to improve over the baseline.