2022
bib
abs
Feeding NMT a Healthy Diet – The Impact of Quality, Quantity, or the Right Type of Nutrients
Abdallah Nasir
|
Sara Alisis
|
Ruba W Jaikat
|
Rebecca Jonsson
|
Sara Qardan
|
Eyas Shawahneh
|
Nour Al-Khdour
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
In the era of gigantic language models, and in our case, Neural Machine Translation (NMT) models, where merely size seems to matter, we’ve been asking ourselves, is it healthy to just feed our NMT model with more and more data? In this presentation, we want to show our findings on the impact of NMT performance of different data “nutrients” we were feeding our models. We have explored the impact of quantity, quality and the type of data we feed to our English-Arabic NMT models. The presentation will show the impact of adding millions of parallel sentences into our training data as opposed to a much smaller data set with much higher quality, and the results from additional experiments with different data nutrients. We will highlight our learnings, challenges and share insights from our Linguistics Quality Assurance team, on what are the advantages and disadvantages of each type of data source and define the criteria of high-quality data with respect to a healthy NMT diet.
bib
abs
Unlocking the value of bilingual translated documents with Deep Learning Segmentation and Alignment for Arabic
Nour Al-Khdour
|
Rebecca Jonsson
|
Ruba W Jaikat
|
Abdallah Nasir
|
Sara Alisis
|
Sara Qardan
|
Eyas Shawahneh
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
To unlock the value of high-quality bilingual translated documents we need parallel data. With sentence-aligned translation pairs, we can fuel our neural machine translation, customize MT or create translation memories for our clients. To automate this process, automatic segmentation and alignment are required. Despite Arabic being the fifth biggest language in the world, language technology for Arabic is many times way behind other languages. We will show how we struggled to find a proper sentence segmentation for Arabic and instead explored different frameworks, from statistical to deep learning, to end up fine-tuning our own Arabic DL segmentation model. We will highlight our learnings and challenges with segmenting and aligning Arabic and English bilingual data. Finally, we will show the impact on our proprietary NMT engine as we started to unlock the value and could leverage data that had been translated offline, outside CAT tools, as well as comparable corpora, to feed our NMT.
2021
bib
abs
A Data-Centric Approach to Real-World Custom NMT for Arabic
Rebecca Jonsson
|
Ruba Jaikat
|
Abdallah Nasir
|
Nour Al-Khdour
|
Sara Alisis
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
In this presentation, we will present our approach to taking Custom NMT to the next level by building tailor-made NMT to fit the needs of businesses seeking to scale in the Arabic-speaking world. In close collaboration with customers in the MENA region and with a deep understanding of their data, we work on building a variety of NMT models that accommodate to the unique challenges of the Arabic language. This session will provide insights into the challenges of acquiring, analyzing, and processing customer data in various sectors, as well as insights into how to best make use of this data to build high-quality Custom NMT models in English-Arabic. Feedback from usage of these models in production will be provided. Furthermore, we will show how to use our translation management system to make the most of the custom NMT, by leveraging the models, fine-tuning and continuing to improve them over time.
2020
pdf
bib
abs
JUSTMasters at SemEval-2020 Task 3: Multilingual Deep Learning Model to Predict the Effect of Context in Word Similarity
Nour Al-khdour
|
Mutaz Bni Younes
|
Malak Abdullah
|
Mohammad AL-Smadi
Proceedings of the Fourteenth Workshop on Semantic Evaluation
There is a growing research interest in studying word similarity. Without a doubt, two similar words in a context may considered different in another context. Therefore, this paper investigates the effect of the context in word similarity. The SemEval-2020 workshop has provided a shared task (Task 3: Predicting the (Graded) Effect of Context in Word Similarity). In this task, the organizers provided unlabeled datasets for four languages, English, Croatian, Finnish and Slovenian. Our team, JUSTMasters, has participated in this competition in the two subtasks: A and B. Our approach has used a weighted average ensembling method for different pretrained embeddings techniques for each of the four languages. Our proposed model outperformed the baseline models in both subtasks and acheived the best result for subtask 2 in English and Finnish, with score 0.725 and 0.68 respectively. We have been ranked the sixth for subtask 1, with scores for English, Croatian, Finnish, and Slovenian as follows: 0.738, 0.44, 0.546, 0.512.
pdf
bib
abs
Team Alexa at NADI Shared Task
Mutaz Younes
|
Nour Al-khdour
|
Mohammad AL-Smadi
Proceedings of the Fifth Arabic Natural Language Processing Workshop
In this paper, we discuss our team’s work on the NADI Shared Task. The task requires classifying Arabic tweets among 21 dialects. We tested out different approaches, and the best one was the simplest one. Our best submission was using Multinational Naive Bayes (MNB) classifier (Small and Hsiao, 1985) with n-grams as features. Despite its simplicity, this classifier shows better results than complicated models such as BERT. Our best submitted score was 17% F1-score and 35% accuracy.