A description and demonstration of SAFAR framework

Several tools and resources have been developed to deal with Arabic NLP. However, a homogenous and flexible Arabic environment that gathers these components is rarely available. In this perspective, we introduce SAFAR which is a monolingual framework developed in accordance with software engineering requirements and dedicated to Arabic language, especially, the modern standard Arabic and Moroccan dialect. After one decade of integration and development, SAFAR possesses today more than 50 tools and resources that can be exploited either using its API or using its web interface.


Introduction
NLP infrastructures, referred also as NLP architectures, represent an efficient way for standardization, optimization of efforts, collaboration and acceleration of developments in the field of NLP. For the last decade, the NLP research community witnessed an extensive release of these infrastructures. Some become very famous such as GATE 1 or Stanford CoreNLP 2 , while others existed only for a very short time. Some are multilingual while others are not, some are targeting multiple domains while others are not, etc. However, it is known that only a few of them are dedicated to only one language such as AraNLP (Althobaiti et al. 2014) or "ITU Turkish Natural Language Processing Pipeline" (G. Eryiğit, 2014). On another hand, the literature shows that existing infrastructures are using randomly three different namings: "toolkit", "platform" and "Framework". From the Software Engineering (SE) perspective, these namings have different meanings. It is then necessary to first define them before presenting, categorizing, and benchmarking NLP infrastructures. Briefly speaking 3 , a toolkit is a set of tools within a single box used for a particular purpose. A platform consists of several interoperable tools with a homogeneous structure but without providing any API to extend their components. A framework is a layered structure developed to be used as a support and guide to build NLP programs and tools. In this work, we focus on the Arabic language infrastructures. We demonstrate that the "Software Architecture for ARabic" (SAFAR) framework 4 is one of the most interesting frameworks to consider when developing any Arabic NLP component. The rest of this article is as follows. Section 2 presents SAFAR in terms of principles, architecture and standards. Section 3 describes SAFAR content. Section 4 is dedicated to SAFAR use and exploitation. Finally, in the last section, we conclude the paper.

SAFAR framework 2.1 Principles
In most cases, the development of Arabic NLP applications requires the use of several tools at once, each dealing with a certain level of language. Generally, these tools are heterogeneous and raise many SE problems such as interoperability, reusability, portability, etc. Moreover, researchers are usually in need not only of tools but also of Language Resources (LRs). To overcome the above-mentioned SE issues and to suit the needs of the ANLP community in terms of processing Arabic effectively and providing reusable LRs, we developed SAFAR as a software architecture for Arabic with the following principles:  Integrate not only tools and programs but also LRs;  Structure the architecture to integrate two types of Arabic, namely MSA, and dialects;  Respect the Arabic language features in the structure of the architecture;  Develop tools or LRs when available ones are not satisfactory;  Provide the architecture to be exploited not only by computer scientists but also by linguists;  Involve in our team computer scientists, statisticians and linguists. In general, our philosophy is not to develop ourselves all the NLP layers and modules, but to integrate existing ones consistently. Consequently, our approach consists in providing the specifications in terms of APIs for each module of our architecture and also providing (if any) implementations of these APIs with tools that have proved to be efficient and published under a free license such as GNU GPL, Apache or Non-Commercial Software. Indeed, the main challenge faced during this integration process is to develop bridges between different programming languages for tools and data structures for resources to use them in a single environment. However, when modules and LRs are not available, we develop them from scratch inside SAFAR. It is worth mentioning that after a certain threshold of maturity (for instance, it is the case of stemming as per the third release), it is useless to continue integrating every new implementation of a given level, with the flexibility that the framework is open enough to allow researchers to do it if needed.

Architecture
SAFAR is a Java-based framework dedicated to Arabic Natural Language Processing. As shown in Figure 1, SAFAR has several layers that provide services directly usable by other layers in accordance with the relationships modeled with arrows in the figure.
5 http://www.alecso.org/site/  Basic: designed to implement tools dealing with morphology, syntax and semantics;  Tools: includes a set of technical services and pre-processing tools as well as machine and deep learning utilities;  Resources: provides services for maintaining, consulting and managing Arabic language resources such as corpora, dictionaries and ontologies;  Application: contains high-level applications such as sentiment analysis or Question/Answering systems;  Client applications: interacts with all other layers to serve clients via web applications, web services, etc.

Standards
Concerning the respect of international standards, and in order to facilitate their use in different contexts, we adopt the interoperability guides for all SAFAR components. Indeed, SAFAR tools input/output and LRs are formatted using the XML representation standard. In addition to the respect of representation standard, we use structuring standards such as Arab League Educational, Cultural and Scientific Organization (ALECSO) 5 recommendations for the design of Arabic morphological analyzers, Lexical Markup Framework (ISO 24613:2008) (LMF) for lexicons and Text Encoding Initiative (Lou Burnard et al. 2008) (TEI) for corpora.

SAFAR content
As previously explained, the structure of SAFAR is split into three main packages: MSA, Dialects and Machine learning models. Since Dialects are numerous, we have been interested so far to integrate only the Moroccan dialect even if the architecture is flexible enough to embed any other dialects.

MSA
This package is the most populated one. Indeed, for almost two decades the research community spent all their efforts in developing components (tools and resources) for this type of Arabic. Table 1 shows all the integrated tools for MSA 6 . These tools have been widely used by the ANLP community and it will be very advantageous to use them within a homogenous and flexible framework. Other tools have been developed from scratch such as "SAFAR stemmer", "SAFAR POS tagger", etc. Tools starting with "SAFAR" in the table have been developed from scratch by our research team for one of the following reasons 1) available tools return incorrect results, 2) there are no similar tools within the community, or 3) existing tools cannot be reused in several technical environments. In addition, the integration of multiple implementations for the same layer allows their benchmarking. Thus, we were able to make a detailed evaluation and/or comparison of stemmers (Jaafar and Bouzoubaa, 2016), morphological analyzers  and parsers (Jaafar and Bouzoubaa, 2017). The column "Per" indicates how many researchers have been involved in the development/integration of the corresponding tool. The "Vr" column indicates SAFAR version from which the tool is present. Users are invited to be aware of these third party licenses and respect them.
On another hand, Table 2 shows all integrated resources for MSA. The LRs building process is based on the Arabic language structure. The concatenative inflection denotes that the lemma concatenates to affixes to produce the stem, which in turn concatenates to clitics to yield the word. And according to their features, a lemma is either a verb, a noun or a particle. From this, we identify the basic components taking part in the composition of the Arabic words which are the lemmas (particle, verb and noun), stems and clitics. Thus, SAFAR follows the above Arabic language structure for lexical resources and contains the three basic alphabets (Loukili andBouzoubaa 2011, Namly et al. 2016), clitics (Namly et al. 2015) and particles lexicon. We also make use of existing and known dictionaries (Contemporary and Interactive). It is worth mentioning that SAFAR contains currently one of the most comprehensive lexicons with more than 7 million stems and corresponding lemmas (Namly et al. 2019).
On another hand, because of the importance of ontologies in many NLP processes, we enriched and integrated the existing Arabic WordNet (Abouenour et al. 2013) (AWN). We note that enriched AWN is approved as the official version of the Global WordNet association 7 . Finally, we also developed and integrated corpora used as reference and evaluation corpora. Indeed, as mentioned above, these corpora as exploited to benchmark integrated tools at the stemming and morphological levels. SAFAR resources are freely available for the community. They can be downloaded from our team website 8 . Indeed, in order to contribute in their wide dissemination within the community, we advertise on SAFAR resources in some wellknown catalogs and repositories such as European Language Resources Association (ELRA) 9 and Common Language Resources and Technology Infrastructure (CLARIN) 10 . Finally, let us mention that a more detailed survey and a software engineering comparative study with similar Arabic frameworks can be found in

Moroccan Dialect
Besides being interested in processing the Arabic language, we take into consideration the informal variety of Moroccan Arabic dialect (MD). Regarding resources, a Moroccan dialect electronic Dictionary (Tachicart et al. 2014) (MDED) has been developed containing almost 12,000 entries with useful annotations. Another lexicon is the Moroccan reference vocabulary (Tachicart et al. 2019) (MRV), which compiles 4.5M possible Moroccan words with respect to a normalization guideline. 7 http://globalwordnet.org/resources/arabic-wordnet/ 8 http://arabic.emi.ac.ma/alelm/?q=Resources 9 http://www.elra.info/en/ Also, a corpus for language identification tasks is available with SAFAR. It is composed of 57k comments collected from social media and then manually classified into three categories: MSA, MD, and code-switched. Besides and based on neural models, a lexicon of orthographic variants that covers almost 54% of the MRV has been generated. It can be useful for several dialectal NLP tasks such as spelling normalization. Table 3 shows all integrated resources for the Moroccan dialect. Concerning tools, a language identification system (Tachicart et al. 2018) has been developed and integrated within SAFAR in order to distinguish between MD and MSA. Besides, we developed and integrated a spelling normalization systems that helps to convert a given Moroccan dialectal word into its standard form without taking into consideration the word context.

Machine learning models
Our tools have been developed combining both the rule-based approach, embedded in lexicons and hardcoded, and the ML approach. Thus, SAFAR includes a set of popular ML libraries (Table 4) geared at different purposes, without the need to perform external tasks. For instance, the SAFAR POS tool exploited weka to output a Decision tree model (Tnaji et al., 2020), the SAFAR lemmatizer exploited HMM (Namly et al., 2020), while the Spelling normalization for the Moroccan dialect used fastText (Tachicart and Bouzoubaa, 2019). Consequently, a researcher making use of SAFAR has the possibility to code calling all integrated Arabic NLP tools and resources in addition to exploiting the integrated ML libraries. Type  Per  Vr  Hidden markov model  Model  3  3  Language model  Model  2  3  Levenshtein  Model  2  3  Weka  Tool  1  3  FastText  Tool  1  3  Table 4: Machine learning models and tools in SAFAR

SAFAR use and exploitation
As previously mentioned, SAFAR tools and integrated resources can be exploited either as an API or from client applications.

API
For each level of processing, we standardize all aspects shared by the same type of tools according to APIs and models so that they become homogenous and flexible in their exploitation. This ensures the standardization inside SAFAR. Users have several possibilities when calling methods by 12 http://arabic.emi.ac.ma/safar-api/SAFAR_v3.jar 13 https://checkstyle.sourceforge.io/ specifying appropriate parameters according to their needs. The execution of a normalizer within SAFAR can be simple as calling "normalizer.normalize(text)". If the normalization should be customized, overloaded methods can be called. It is worth mentioning that when developing the SAFAR API 12 , we fully respect "Checkstyle" 13 and "FindBugs" 14 which are two development tools that help adhering to coding standards. Users could also easily create customizable pipelines where the output of one component is the input of another (Jaafar and Bouzoubaa, 2015). All these aspects of SAFAR help solving SE issues especially the interoperability, the reuse and the flexibility of exploitation. As mentioned in Figure 2, at line 3, we specify the input text. At line 5, we call the "SAFARNormalizer" tool to normalize the text. At line 7 we call SAFAR "IParticleService" (Namly, et al. 2015) in order to delete stop words. At line 10, we instantiate the "SAFARTokenizer" tool which takes a text as input and outputs all tokens of the text. At line 13, we proceed to stemming tokens by calling the "IStemmer" service and specifying the Light10 stemmer in this case. At line 18, we call "ILexiconService" to detect stems sentiments and then print the sentiments of each word according to the predefined lexicon. Executing the whole process with another stemmer is simply to keep the same code and change only line 13 such as ".getKhojaImpletation".

Web application
For non-developers such as linguists, SAFAR framework can be executed using an online application 15 in which all SAFAR levels are developed as online processing. Accessing the website allows the user to have access to all tools and resources mentioned above. Results can be either printed on the same page or downloaded as XML files. As an example, Figure 3 shows the online morphological analysis for the word ‫"يأكﻼن"‬ (they eat). After selecting the morphological analyzer to use via the drop-down menu (Alkhalil in this case) and clicking on the "Analyze & display" button, the output is displayed in a table format. Furthermore, the language identification system (Tachicart et al. 2018) demonstrated in Figure 4, aims to distinguish between Moroccan Dialect and MSA using two different methods. Indeed, the first is rule-based and relies on stop word frequency, while the second is statically-based and is based on an SVM machine learning classifier.

Conclusion
SAFAR is a monolingual framework dedicated to Arabic language. It is considered as a repository and collaborative work where multiple developers of Arabic tools and resources can meet and share their products. It is in its second decade of existence and integrates more than 50 tools and resources. The next steps of our journey are to:  Concentrate on less considered layers such as semantics and applications;  Integrate and develop other tools and resources for dialects and standard Arabic;  Build bridges with multilingual or other language frameworks for developers interested to consider more than one language in their projects such as machine translation.