Hakimeh Fadaee

Also published as: Hakimeh Fadaei


2024

pdf bib
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning
Shivalika Singh | Freddie Vargus | Daniel D’souza | Börje Karlsson | Abinaya Mahendiran | Wei-Yin Ko | Herumb Shandilya | Jay Patel | Deividas Mataciunas | Laura O’Mahony | Mike Zhang | Ramith Hettiarachchi | Joseph Wilson | Marina Machado | Luisa Moura | Dominik Krzemiński | Hakimeh Fadaei | Irem Ergun | Ifeoma Okoh | Aisha Alaagib | Oshan Mudannayake | Zaid Alyafeai | Vu Chien | Sebastian Ruder | Surya Guthikonda | Emad Alghamdi | Sebastian Gehrmann | Niklas Muennighoff | Max Bartolo | Julia Kreutzer | Ahmet Üstün | Marzieh Fadaee | Sara Hooker
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Datasets are foundational to many breakthroughs in modern artificial intelligence. Many recent achievements in the space of natural language processing (NLP) can be attributed to the fine-tuning of pre-trained models on a diverse set of tasks that enables a large language model (LLM) to respond to instructions. Instruction fine-tuning (IFT) requires specifically constructed and annotated datasets. However, existing datasets are almost all in the English language. In this work, our primary goal is to bridge the language gap by building a human-curated instruction-following dataset spanning 65 languages. We worked with fluent speakers of languages from around the world to collect natural instances of instructions and completions. Furthermore, we create the most extensive multilingual collection to date, comprising 513 million instances through templating and augmenting existing datasets across 114 languages. In total, we contribute three key resources: we develop and open-source the Aya Dataset, the Aya Collection, and the Aya Evaluation Suite. The Aya initiative also serves as a valuable case study in participatory research, involving collaborators from 119 countries. We see this as an important framework for future research collaborations that aim to bridge gaps in resources.

2010

pdf bib
Extracting Lexico-conceptual Knowledge for Developing Persian WordNet
Mehrnoush Shamsfard | Hakimeh Fadaei | Elham Fekri
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Semantic lexicons and lexical ontologies are some major resources in natural language processing. Developing such resources are time consuming tasks for which some automatic methods are proposed. This paper describes some methods used in semi-automatic development of FarsNet; a lexical ontology for the Persian language. FarsNet includes the Persian WordNet with more than 10000 synsets of nouns, verbs and adjectives. In this paper we discuss extraction of lexico-conceptual relations such as synonymy, antonymy, hyperonymy, hyponymy, meronymy, holonymy and other lexical or conceptual relations between words and concepts (synsets) from Persian resources. Relations are extracted from different resources like web, corpora, Wikipedia, Wiktionary, dictionaries and WordNet. In the system presented in this paper a variety of approaches are applied in the task of relation extraction to extract ladled or unlabeled relations. They exploit the texts, structures, hyperlinks and statistics of web documents as well as the relations of English WordNet and entries of mono and bi-lingual dictionaries.

2008

pdf bib
A Hybrid Morphology-Based POS Tagger for Persian
Mehrnoush Shamsfard | Hakimeh Fadaee
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In many applications of natural language processing (NLP) grammatically tagged corpora are needed. Thus Part of Speech (POS) Tagging is of high importance in the domain of NLP. Many taggers are designed with different approaches to reach high performance and accuracy. These taggers usually deal with inter-word relations and they make use of lexicons. In this paper we present a new tagging algorithm with a hybrid approach. This algorithm combines the features of probabilistic and rule-based taggers to tag Persian unknown words. In contrast with many other tagging algorithms this algorithm deals with the internal structure of the words and it does not need any built in knowledge. The introduced tagging algorithm is domain independent because it uses morphological rules. In this algorithm POS tags are assigned to unknown word with a probability which shows the accuracy of the assigned POS tag. Although this tagger is proposed for Persian, it can be adapted to other languages by applying their morphological rules.