Maraim Masoud


2024

pdf bib
Documenting Geographically and Contextually Diverse Language Data Sources
Angelina McMillan-Major | Francesco De Toni | Zaid Alyafeai | Stella Biderman | Kimbo Chen | Gérard Dupont | Hady Elsahar | Chris Emezue | Alham Fikri Aji | Suzana Ilić | Nurulaqilla Khamis | Colin Leong | Maraim Masoud | Aitor Soroa | Pedro Ortiz Suarez | Daniel van Strien | Zeerak Talat | Yacine Jernite
Northern European Journal of Language Technology, Volume 10

Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.

2022

pdf bib
You reap what you sow: On the Challenges of Bias Evaluation Under Multilingual Settings
Zeerak Talat | Aurélie Névéol | Stella Biderman | Miruna Clinciu | Manan Dey | Shayne Longpre | Sasha Luccioni | Maraim Masoud | Margaret Mitchell | Dragomir Radev | Shanya Sharma | Arjun Subramonian | Jaesung Tae | Samson Tan | Deepak Tunuguntla | Oskar Van Der Wal
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Evaluating bias, fairness, and social impact in monolingual language models is a difficult task. This challenge is further compounded when language modeling occurs in a multilingual context. Considering the implication of evaluation biases for large multilingual language models, we situate the discussion of bias evaluation within a wider context of social scientific research with computational work. We highlight three dimensions of developing multilingual bias evaluation frameworks: (1) increasing transparency through documentation, (2) expanding targets of bias beyond gender, and (3) addressing cultural differences that exist between languages. We further discuss the power dynamics and consequences of training large language models and recommend that researchers remain cognizant of the ramifications of developing such technologies.

pdf bib
Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Zaid Alyafeai | Maraim Masoud | Mustafa Ghaleb | Maged S. Al-shaibani
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper, we create Masader, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, we develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.

2020

pdf bib
Towards Machine Translation for the Kurdish Language
Sina Ahmadi | Maraim Masoud
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

Machine translation is the task of translating texts from one language to another using computers. It has been one of the major tasks in natural language processing and computational linguistics and has been motivating to facilitate human communication. Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced. Therefore, in this paper, we are addressing the main issues in creating a machine translation system for the Kurdish language, with a focus on the Sorani dialect. We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation. We also discuss some of the major challenges in Kurdish language translation and demonstrate how fundamental text processing tasks, such as tokenization, can improve translation performance.

2019

pdf bib
Leveraging Rule-Based Machine Translation Knowledge for Under-Resourced Neural Machine Translation Models
Daniel Torregrosa | Nivranshu Pasricha | Maraim Masoud | Bharathi Raja Chakravarthi | Juan Alonso | Noe Casas | Mihael Arcan
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks