2022
pdf
bib
abs
Developing a Tag-Set and Extracting the Morphological Lexicons to Build a Morphological Analyzer for Egyptian Arabic
Amany Fashwan
|
Sameh Alansary
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)
This paper sheds light on an in-progress work for building a morphological analyzer for Egyptian Arabic (EGY). To build such a tool, a tag-set schema is developed depending on a corpus of 527,000 EGY words covering different sources and genres. This tag-set schema is used in annotating about 318,940 words, morphologically, according to their contexts. Each annotated word is associated with its suitable prefix(s), original stem, tag, suffix(s), glossary, number, gender, definiteness, and conventional lemma and stem. These morphologically annotated words, in turns, are used in developing the proposed morphological analyzer where the morphological lexicons and the compatibility tables are extracted and tested. The system is compared with one of best EGY morphological analyzers; CALIMA.
2017
pdf
bib
abs
SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts
Amany Fashwan
|
Sameh Alansary
Proceedings of the Third Arabic Natural Language Processing Workshop
This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-based disambiguation layer and Out Of Vocabulary (OOV) layer. The adopted syntactic disambiguation algorithms is concerned with detecting the case ending diacritics depending on a rule based approach simulating the shallow parsing technique. This will be achieved using an annotated corpus for extracting the Arabic linguistic rules, building the language models and testing the system output. This system is considered as a good trial of the interaction between rule-based approach and statistical approach, where the rules can help the statistics in detecting the right diacritization and vice versa. At this point, the morphological Word Error Rate (WER) is 4.56% while the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is 9.36%. The best WER is 14.78% compared to the best-published results, of (Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally, Rashwan, & Atiya, 2016); 13.70%.
2014
pdf
bib
abs
MUHIT: A Multilingual Harmonized Dictionary
Sameh Alansary
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper discusses a trial to build a multilingual harmonized dictionary that contains more than 40 languages, with special reference to Arabic which represents about 20% of the whole size of the dictionary. This dictionary is called MUHIT which is an interactive multilingual dictionary application. It is a web application that makes it easily accessible to all users. MUHIT is developed within the Universal Networking Language (UNL) framework by the UNDL Foundation, in cooperation with Bibliotheca Alexandrina (BA). This application targets to serve specialists and non-specialists. It provides users with full linguistic description to each lexical item. This free application is useful to many NLP tasks such as multilingual translation and cross-language synonym search. This dictionary is built depending on WordNet and corpus based approaches, in a specially designed linguistic environment called UNLariam that is developed by the UNLD foundation. This dictionary is the first launched application by the UNLD foundation.
pdf
bib
The International Corpus of Arabic: Compilation, Analysis and Evaluation
Sameh Alansary
|
Magdy Nagi
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
2012
pdf
bib
A Formalized Reference Grammar for UNL-Based Machine Translation between English and Arabic
Sameh Alansary
Proceedings of COLING 2012: Posters