Marc Tanti


2024

pdf bib
Tokenisation in Machine Translation Does Matter: The impact of different tokenisation approaches for Maltese
Kurt Abela | Kurt Micallef | Marc Tanti | Claudia Borg
Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024)

In Machine Translation, various tokenisers are used to segment inputs before training a model. Despite tokenisation being mostly considered a solved problem for languages such as English, it is still unclear as to how effective different tokenisers are for morphologically rich languages. This study aims to explore how different approaches to tokenising Maltese impact machine translation results on the English-Maltese language pair.We observed that the OPUS-100 dataset has tokenisation inconsistencies in Maltese. We empirically found that training models on the original OPUS-100 dataset led to the generation of sentences with these issues.We therefore release an updated version of the OPUS-100 parallel English-Maltese dataset, referred to as OPUS-100-Fix, fixing these inconsistencies in Maltese by using the MLRS tokeniser. We show that after fixing the inconsistencies in the dataset, results on the fixed test set increase by 2.49 BLEU points over models trained on the original OPUS-100. We also experiment with different tokenisers, including BPE and SentencePiece to find the ideal tokeniser and vocabulary size for our setup, which was shown to be BPE with a vocabulary size of 8,000. Finally, we train different models in both directions for the ENG-MLT language pair using OPUS-100-Fix by training models from scratch as well as fine-tuning other pre-trained models, namely mBART-50 and NLLB, where a finetuned NLLB model performed the best.

pdf bib
Towards Content Accessibility Through Lexical Simplification for Maltese as a Low-Resource Language
Martina Meli | Marc Tanti | Chris Porter
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

Natural Language Processing techniques have been developed to assist in simplifying online content while preserving meaning. However, for low-resource languages, like Maltese, there are still numerous challenges and limitations. Lexical Simplification (LS) is a core technique typically adopted to improve content accessibility, and has been widely studied for high-resource languages such as English and French. Motivated by the need to improve access to Maltese content and the limitations in this context, this work set out to develop and evaluate an LS system for Maltese text. An LS pipeline was developed consisting of (1) potential complex word identification, (2) substitute generation, (3) substitute selection, and (4) substitute ranking. An evaluation data set was developed to assess the performance of each step. Results are encouraging and will lead to numerous future work. Finally, a single-blind study was carried out with over 200 participants, where the system’s perceived quality in text simplification was evaluated. Results suggest that meaning is retained about 50% of the time, and when meaning is retained, about 70% of system-generated sentences are either perceived as simpler or of equal simplicity to the original. Challenges remain, and this study proposes a number of areas that may benefit from further research.

2023

pdf bib
Cross-lingual Transfer Learning with Persian
Sepideh Mollanorozy | Marc Tanti | Malvina Nissim
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

The success of cross-lingual transfer learning for POS tagging has been shown to be strongly dependent, among other factors, on the (typological and/or genetic) similarity of the low-resource language used for testing and the language(s) used in pre-training or to fine-tune the model. We further unpack this finding in two directions by zooming in on a single language, namely Persian. First, still focusing on POS tagging we run an in-depth analysis of the behaviour of Persian with respect to closely related languages and languages that appear to benefit from cross-lingual transfer with Persian. To do so, we also use the World Atlas of Language Structures to determine which properties are shared between Persian and other languages included in the experiments. Based on our results, Persian seems to be a reasonable potential language for Kurmanji and Tagalog low-resource languages for other tasks as well. Second, we test whether previous findings also hold on a task other than POS tagging to pull apart the benefit of language similarity and the specific task for which such benefit has been shown to hold. We gather sentiment analysis datasets for 31 target languages and through a series of cross-lingual experiments analyse which languages most benefit from Persian as the source. The set of languages that benefit from Persian had very little overlap across the two tasks, suggesting a strong task-dependent component in the usefulness of language similarity in cross-lingual transfer.

2022

pdf bib
Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese
Kurt Micallef | Albert Gatt | Marc Tanti | Lonneke van der Plas | Claudia Borg
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing

Multilingual language models such as mBERT have seen impressive cross-lingual transfer to a variety of languages, but many languages remain excluded from these models. In this paper, we analyse the effect of pre-training with monolingual data for a low-resource language that is not included in mBERT – Maltese – with a range of pre-training set ups. We conduct evaluations with the newly pre-trained models on three morphosyntactic tasks – dependency parsing, part-of-speech tagging, and named-entity recognition – and one semantic classification task – sentiment analysis. We also present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance. Our results show that using a mixture of pre-training domains is often superior to using Wikipedia text only. We also find that a fraction of this corpus is enough to make significant leaps in performance over Wikipedia-trained models. We pre-train and compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pretrained multilingual BERT (mBERTu). The models achieve state-of-the-art performance on these tasks, despite the new corpus being considerably smaller than typically used corpora for high-resourced languages. On average, BERTu outperforms or performs competitively with mBERTu, and the largest gains are observed for higher-level tasks.

pdf bib
On the Cusp of Comprehensibility: Can Language Models Distinguish Between Metaphors and Nonsense?
Bernadeta Griciūtė | Marc Tanti | Lucia Donatelli
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)

Utterly creative texts can sometimes be difficult to understand, balancing on the edge of comprehensibility. However, good language skills and common sense allow advanced language users both to interpret creative texts and to reject some linguistic input as nonsense. The goal of this paper is to evaluate whether the current language models are also able to make the distinction between a creative language use and nonsense. To test this, we have computed mean rank and pseudo-log-likelihood score (PLL) of metaphorical and nonsensical sentences, and fine-tuned several pretrained models (BERT, RoBERTa) for binary classification between the two categories. There was a significant difference in the mean ranks and PPL scores of the categories, and the classifier reached around 85.5% accuracy. The results raise further questions on what could have let to such satisfactory performance.

pdf bib
Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind
Patrizia Paggio | Albert Gatt | Marc Tanti
Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind

pdf bib
Face2Text revisited: Improved data set and baseline results
Marc Tanti | Shaun Abdilla | Adrian Muscat | Claudia Borg | Reuben A. Farrugia | Albert Gatt
Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind

Current image description generation models do not transfer well to the task of describing human faces. To encourage the development of more human-focused descriptions, we developed a new data set of facial descriptions based on the CelebA image data set. We describe the properties of this data set, and present results from a face description generator trained on it, which explores the feasibility of using transfer learning from VGGFace/ResNet CNNs. Comparisons are drawn through both automated metrics and human evaluation by 76 English-speaking participants. The descriptions generated by the VGGFace-LSTM + Attention model are closest to the ground truth according to human evaluation whilst the ResNet-LSTM + Attention model obtained the highest CIDEr and CIDEr-D results (1.252 and 0.686 respectively). Together, the new data set and these experimental results provide data and baselines for future work in this area.

2021

pdf bib
On the Language-specificity of Multilingual BERT and the Impact of Fine-tuning
Marc Tanti | Lonneke van der Plas | Claudia Borg | Albert Gatt
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Recent work has shown evidence that the knowledge acquired by multilingual BERT (mBERT) has two components: a language-specific and a language-neutral one. This paper analyses the relationship between them, in the context of fine-tuning on two tasks – POS tagging and natural language inference – which require the model to bring to bear different degrees of language-specific knowledge. Visualisations reveal that mBERT loses the ability to cluster representations by language after fine-tuning, a result that is supported by evidence from language identification experiments. However, further experiments on ‘unlearning’ language-specific representations using gradient reversal and iterative adversarial learning are shown not to add further improvement to the language-independent component over and above the effect of fine-tuning. The results presented here suggest that the process of fine-tuning causes a reorganisation of the model’s limited representational capacity, enhancing language-independent representations at the expense of language-specific ones.

2019

pdf bib
Visually grounded generation of entailments from premises
Somayeh Jafaritazehjani | Albert Gatt | Marc Tanti
Proceedings of the 12th International Conference on Natural Language Generation

Natural Language Inference (NLI) is the task of determining the semantic relationship between a premise and a hypothesis. In this paper, we focus on the generation of hypotheses from premises in a multimodal setting, to generate a sentence (hypothesis) given an image and/or its description (premise) as the input. The main goals of this paper are (a) to investigate whether it is reasonable to frame NLI as a generation task; and (b) to consider the degree to which grounding textual premises in visual information is beneficial to generation. We compare different neural architectures, showing through automatic and human evaluation that entailments can indeed be generated successfully. We also show that multimodal models outperform unimodal models in this task, albeit marginally

2018

pdf bib
Grounded Textual Entailment
Hoa Trong Vu | Claudio Greco | Aliia Erofeeva | Somayeh Jafaritazehjan | Guido Linders | Marc Tanti | Alberto Testoni | Raffaella Bernardi | Albert Gatt
Proceedings of the 27th International Conference on Computational Linguistics

Capturing semantic relations between sentences, such as entailment, is a long-standing challenge for computational semantics. Logic-based models analyse entailment in terms of possible worlds (interpretations, or situations) where a premise P entails a hypothesis H iff in all worlds where P is true, H is also true. Statistical models view this relationship probabilistically, addressing it in terms of whether a human would likely infer H from P. In this paper, we wish to bridge these two perspectives, by arguing for a visually-grounded version of the Textual Entailment task. Specifically, we ask whether models can perform better if, in addition to P and H, there is also an image (corresponding to the relevant “world” or “situation”). We use a multimodal version of the SNLI dataset (Bowman et al., 2015) and we compare “blind” and visually-augmented models of textual entailment. We show that visual information is beneficial, but we also conduct an in-depth error analysis that reveals that current multimodal models are not performing “grounding” in an optimal fashion.

pdf bib
Face2Text: Collecting an Annotated Image Description Corpus for the Generation of Rich Face Descriptions
Albert Gatt | Marc Tanti | Adrian Muscat | Patrizia Paggio | Reuben A Farrugia | Claudia Borg | Kenneth P Camilleri | Michael Rosner | Lonneke van der Plas
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?
Marc Tanti | Albert Gatt | Kenneth Camilleri
Proceedings of the 10th International Conference on Natural Language Generation

Image captioning has evolved into a core task for Natural Language Generation and has also proved to be an important testbed for deep learning approaches to handling multimodal representations. Most contemporary approaches rely on a combination of a convolutional network to handle image features, and a recurrent network to encode linguistic information. The latter is typically viewed as the primary “generation” component. Beyond this high-level characterisation, a CNN+RNN model supports a variety of architectural designs. The dominant model in the literature is one in which visual features encoded by a CNN are “injected” as part of the linguistic encoding process, driving the RNN’s linguistic choices. By contrast, it is possible to envisage an architecture in which visual and linguistic features are encoded separately, and merged at a subsequent stage. In this paper, we address two related questions: (1) Is direct injection the best way of combining multimodal information, or is a late merging alternative better for the image captioning task? (2) To what extent should a recurrent network be viewed as actually generating, rather than simply encoding, linguistic information?

pdf bib
LCT-MALTA’s Submission to RepEval 2017 Shared Task
Hoa Trong Vu | Thuong-Hai Pham | Xiaoyu Bai | Marc Tanti | Lonneke van der Plas | Albert Gatt
Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.