2024
pdf
bib
abs
The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj
|
Saad Ezzini
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.
pdf
bib
abs
DARES: Dataset for Arabic Readability Estimation of School Materials
Mo El-Haj
|
Sultan Almujaiwel
|
Damith Premasiri
|
Tharindu Ranasinghe
|
Ruslan Mitkov
Proceedings of the Workshop on DeTermIt! Evaluating Text Difficulty in a Multilingual Context @ LREC-COLING 2024
This research introduces DARES, a dataset for assessing the readability of Arabic text in Saudi school materials. DARES compromise of 13335 instances from textbooks used in 2021 and contains two subtasks; (a) Coarse-grained readability assessment where the text is classified into different educational levels such as primary and secondary. (b) Fine-grained readability assessment where the text is classified into individual grades.. We fine-tuned five transformer models that support Arabic and found that CAMeLBERTmix performed the best in all input settings. Evaluation results showed high performance for the coarse-grained readability assessment task, achieving a weighted F1 score of 0.91 and a macro F1 score of 0.79. The fine-grained task achieved a weighted F1 score of 0.68 and a macro F1 score of 0.55. These findings demonstrate the potential of our approach for advancing Arabic text readability assessment in education, with implications for future innovations in the field.
pdf
bib
abs
AraFinNLP 2024: The First Arabic Financial NLP Shared Task
Sanad Malaysha
|
Mo El-Haj
|
Saad Ezzini
|
Mohammed Khalilia
|
Mustafa Jarrar
|
Sultan Almujaiwel
|
Ismail Berrada
|
Houda Bouamor
Proceedings of The Second Arabic Natural Language Processing Conference
The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots.A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.
2023
pdf
bib
abs
Exploring Abstractive Text Summarisation for Podcasts: A Comparative Study of BART and T5 Models
Parth Saxena
|
Mo El-Haj
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Podcasts have become increasingly popular in recent years, resulting in a massive amount of audio content being produced every day. Efficient summarisation of podcast episodes can enable better content management and discovery for users. In this paper, we explore the use of abstractive text summarisation methods to generate high-quality summaries of podcast episodes. We use pre-trained models, BART and T5, to fine-tune on a dataset of Spotify’s 100K podcast. We evaluate our models using automated metrics and human evaluation, and find that the BART model fine-tuned on the podcast dataset achieved a higher ROUGE-1 and ROUGE-L score compared to other models, while the T5 model performed better in terms of semantic meaning. The human evaluation indicates that both models produced high-quality summaries that were well received by participants. Our study demonstrates the effectiveness of abstractive summarisation methods for podcast episodes and offers insights for improving the summarisation of audio content.
pdf
bib
Unifying Emotion Analysis Datasets using Valence Arousal Dominance (VAD)
Mo El-Haj
|
Ryutaro Takanami
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
FinAraT5: A text to text model for financial Arabic text understanding and generation
Nadhem Zmandar
|
Mo El-Haj
|
Paul Rayson
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
Open-Source Thesaurus Development for Under-Resourced Languages: a Welsh Case Study
Nouran Khallaf
|
Elin Arfon
|
Mo El-Haj
|
Jonathan Morris
|
Dawn Knight
|
Paul Rayson
|
Tymaa Hasanain Hammouda
|
Mustafa Jarrar
Proceedings of the 4th Conference on Language, Data and Knowledge