Kurt Micallef


2024

pdf bib
UM IWSLT 2024 Low-Resource Speech Translation: Combining Maltese and North Levantine Arabic
Sara Nabhani | Aiden Williams | Miftahul Jannat | Kate Rebecca Belcher | Melanie Galea | Anna Taylor | Kurt Micallef | Claudia Borg
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

The IWSLT low-resource track encourages innovation in the field of speech translation, particularly in data-scarce conditions. This paper details our submission for the IWSLT 2024 low-resource track shared task for Maltese-English and North Levantine Arabic-English spoken language translation using an unconstrained pipeline approach. Using language models, we improve ASR performance by correcting the produced output. We present a 2 step approach for MT using data from external sources showing improvements over baseline systems. We also explore transliteration as a means to further augment MT data and exploit the cross-lingual similarities between Maltese and Arabic.

pdf bib
Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching
Kurt Micallef | Nizar Habash | Claudia Borg | Fadhl Eryani | Houda Bouamor
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model’s pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

pdf bib
Tokenisation in Machine Translation Does Matter: The impact of different tokenisation approaches for Maltese
Kurt Abela | Kurt Micallef | Marc Tanti | Claudia Borg
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

In Machine Translation, various tokenisers are used to segment inputs before training a model. Despite tokenisation being mostly considered a solved problem for languages such as English, it is still unclear as to how effective different tokenisers are for morphologically rich languages. This study aims to explore how different approaches to tokenising Maltese impact machine translation results on the English-Maltese language pair.We observed that the OPUS-100 dataset has tokenisation inconsistencies in Maltese. We empirically found that training models on the original OPUS-100 dataset led to the generation of sentences with these issues.We therefore release an updated version of the OPUS-100 parallel English-Maltese dataset, referred to as OPUS-100-Fix, fixing these inconsistencies in Maltese by using the MLRS tokeniser. We show that after fixing the inconsistencies in the dataset, results on the fixed test set increase by 2.49 BLEU points over models trained on the original OPUS-100. We also experiment with different tokenisers, including BPE and SentencePiece to find the ideal tokeniser and vocabulary size for our setup, which was shown to be BPE with a vocabulary size of 8,000. Finally, we train different models in both directions for the ENG-MLT language pair using OPUS-100-Fix by training models from scratch as well as fine-tuning other pre-trained models, namely mBART-50 and NLLB, where a finetuned NLLB model performed the best.

pdf bib
Topic Classification and Headline Generation for Maltese Using a Public News Corpus
Amit Kumar Chaudhary | Kurt Micallef | Claudia Borg
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The development of NLP tools for low-resource languages is impeded by the lack of data. While recent unsupervised pre-training approaches ease this requirement, the need for labelled data is crucial to progress the development of such tools. Moreover, publicly available datasets for such languages typically cover low-level syntactic tasks. In this work, we introduce new semantic datasets for Maltese generated automatically using associated metadata from a corpus in the news domain. The datasets are a news tag multi-label classification and a news abstractive summarisation task by generating its title. We also present an evaluation using publicly available models as baselines. Our results show that current models are lacking the semantic knowledge required to solve such tasks, shedding light on the need to use better modelling approaches for Maltese.

2023

pdf bib
UM-DFKI Maltese Speech Translation
Aiden Williams | Kurt Abela | Rishu Kumar | Martin Bär | Hannah Billinghurst | Kurt Micallef | Ahnaf Mozib Samin | Andrea DeMarco | Lonneke van der Plas | Claudia Borg
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

For the 2023 IWSLT Maltese Speech Translation Task, UM-DFKI jointly presents a cascade solution which achieves 0.6 BLEU. While this is the first time that a Maltese speech translation task has been released by IWSLT, this paper explores previous solutions for other speech translation tasks, focusing primarily on low-resource scenarios. Moreover, we present our method of fine-tuning XLS-R models for Maltese ASR using a collection of multi-lingual speech corpora as well as the fine-tuning of the mBART model for Maltese to English machine translation.

pdf bib
Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect
Kurt Micallef | Fadhl Eryani | Nizar Habash | Houda Bouamor | Claudia Borg
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

Multilingual models such as mBERT have been demonstrated to exhibit impressive crosslingual transfer for a number of languages. Despite this, the performance drops for lowerresourced languages, especially when they are not part of the pre-training setup and when there are script differences. In this work we consider Maltese, a low-resource language of Arabic and Romance origins written in Latin script. Specifically, we investigate the impact of transliterating Maltese into Arabic scipt on a number of downstream tasks: Part-of-Speech Tagging, Dependency Parsing, and Sentiment Analysis. We compare multiple transliteration pipelines ranging from deterministic character maps to more sophisticated alternatives, including manually annotated word mappings and non-deterministic character mappings. For the latter, we show that selection techniques using n-gram language models of Tunisian Arabic, the dialect with the highest degree of mutual intelligibility to Maltese, yield better results on downstream tasks. Moreover, our experiments highlight that the use of an Arabic pre-trained model paired with transliteration outperforms mBERT. Overall, our results show that transliterating Maltese can be considered an option to improve the cross-lingual transfer capabilities.

2022

pdf bib
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Kurt Micallef | Albert Gatt | Marc Tanti | Lonneke van der Plas | Claudia Borg
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT – Maltese – with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks – dependency parsing, part-of-speech tagging, and named-entity recognition – and one semantic classification task – sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pretrained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.