ArBanking77: Intent Detection Neural Model and a New Dataset in Modern and Dialectical Arabic

This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain. Our dataset was arabized and localized from the original English Banking77 dataset, which consists of 13,083 queries to ArBanking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect, with each query classified into one of the 77 classes (intents). Furthermore, we present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect, respectively. We performed extensive experimentation in which we simulated low-resource settings, where the model is trained on a subset of the data and augmented with noisy queries to simulate colloquial terms, mistakes and misspellings found in real NLP systems, especially live chat queries. The data and the models are publicly available at https://sina.birzeit.edu/arbanking77.


Introduction
Intent detection falls under natural language understanding (NLU) and it aims at parsing the semantics of the user input in order to generate the best response.Intent representation is a mapping between the user request and the actions the chatbot triggers (Adamopoulou and Moussiades, 2020).Intent detection is typically considered a classification task, where each utterance is associated with one, and sometimes multiple, intents (Figure 1).Intent detection can be a challenging problem.The utterances during the chat are usually short, providing only a brief context to rely on when predicting the intent and the label space can be very large requiring massive data annotation.In this paper, we present an Arabic intent dataset and a Bidirectional Encoder Representations from Transformers (BERT) based intent detection model.
The Arabic corpus presented in this paper is based on the Banking77, an English question-intent corpus for banking (Casanueva et al., 2020).Bank-ing77 includes 13,083 queries, each query classified into one of the 77 intents.We first arabized the English Banking77 by providing an MSA version to each of the 13,083 queries, resulting in 15,537 MSA queries (some queries have more than one MSA variation).The arabization was done semi-automatically, first we used Google Translate and then manually verified and revised each query.Second, each query was manually re-written in the Palestinian dialect, resulting in 15,867 queries, which makes the data linguistically more representative from various aspects including phonology, morphology, lexicon, and syntax (Haff et al., 2022;Jarrar et al., 2017).The final dataset contains 31,404 queries, which was used to train a BERT-based model on intent detection task.
The rest of the paper is organized as follows: section 2 reviews the related work, section 3 presents the ArBanking77 corpus including data arabization and localization, section 4 presents the model architecture and training, section 5 presents the results for intent detection, section 6 presents our conclusion and section 7 states limitations.

Related Work
Arabic has a limited number of available labeled datasets, especially for dialectal and domainspecific tasks (Darwish et al., 2021;Naser-Karajah et al., 2021).Due to data scarcity in Arabic language, research on Arabic intent detection is almost non-existent.Others have also stated the same, where conversational machine learning systems in Arabic are limited due to deficiency of datasets (Fuad and Al-Yahya, 2022) and Arabic conversational systems are lagging behind in applying the latest technology (Ahmed et al., 2022).
One of the closest work to Arabic intent detection is purposed in (Mezzi et al., 2022).The authors proposed intent detection model for the mental health domain in Tunisian Arabic.The idea is to classify the patient utterance or concern into five aspects: depression, suicide, panic disorder, social phobia and adjustment disorder.The data set was collected by simulating a real-life psychiatric interview where a 3D human avatar plays the doctor and asks the patient questions in Tunisian Arabic.The patient, in return, interacts with the avatar by answering the questions vocally, then the audio is transcribed to text.The authors used BERT as the encoder and added five binary classifiers, one classifier for each intent, achieving 0.94 F1 score.Hijjawi et al. 2013 classified question and nonquestion utterances in chatbots.Decision trees were used to perform the classification and the model was integrated into ArabChat (Hijjawi et al., 2014) to classify utterances before processing them.Joukhadar et al. (2019) published a corpus in the Levantine Arabic dialect consisting of 873 sentences manually tagged with one of eight acts (greetings, goodbye, thanks, confirm, negate, ask/repeat, ask for alternative, and apology).The authors tried two features including Term Frequency-Inverse Document Frequency (TF-IDF) and n-gram.They also experimented with multiple classifiers and they concluded that Support Vector Machine (SVM) with 2-gram features performed the best at 0.86 accuracy.Elmadany et al. (2018) introduced a speech-act recognition and sentiment dataset (ArSAS).About 21K tweets were collected and manually labeled with two types of classes: speech-act and sentiment.Speech-act labels include expression, assertion and question, while the sentiment labels are negative, positive, neutral and mixed.Algotiml et al. (2019) trained two models on the ArSAS dataset, a Bidirectional Long-Short Term Memory (BiLSTM) and SVM and achieved an accuracy of 0.875 and a macro F1 score of 0.615.(Zhou et al., 2022) proposed a contrastive based learning for out-of-domain data and tested the performance on multiple datasets including the Banking data (Casanueva et al., 2020) and they demonstrated improvement of the out-of-domain data without sacrificing performance on in-domain-data.
Other related languages for which intent detection was studied is Urdu.In (Shams et al., 2019), the authors translated the Air Travel Information System (ATIS) (Hemphill et al., 1990) and AOL datasets from English to Urdu and performed intent detection using a combination of CNNs, LSTMs and BiLSTMs models.For ATIS, CNN performed the best at 0.924 accuracy, while for AOL, BiLSTM achieved the highest performance at 0.831 accuracy.In later work the authors improved the accuracy to reach 0.9112 (Shams and Aslam, 2022).ATIS was also used for intent detection in the Indonesian language (Bilah et al., 2022) and the authors reported an accuracy of 0.9584 using a CNN-based model.(Basu et al., 2022) utilized Snips (Coucke et al., 2018) and ATIS to train a meta-learning approach with contrastive learning for intent detection and slot-filling.Snips dataset covers multiple domains including restaurants, books, weather and music, making it more challenging than ATIS.The data is collected using Snips personal assistant and contains 16K queries labeled with 7 intents.
The reader may have already noticed that we could not find relevant work related to Arabic intent detection recognition or any related work on labeled Arabic intent datasets.In this paper, we attempt to address these two issues, Arabic intent corpus and intent recognition.We present the Ar-Banking77, an Arabic intent dataset, which was arabized and localized from the Banking77 English dataset (Casanueva et al., 2020).ArBanking77 was also augmented with thousands of additional MSA and Palestinian dialect queries, resulting in a final dataset of 31,404 queries and 77 intents.ArBank-ing77 was used to fine-tune BERT-based model, achieving an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect, respectively.

The ArBanking77 Corpus
The ArBanking77 corpus is derived from the Bank-ing77 dataset (Casanueva et al., 2020)  Banking77 was arabized and localized into Ar-Banking77 by 26 annotators through multiple phases and over several months.Each query in the Banking77 has at least two corresponding queries in the ArBanking77 (at least one query written in each MSA and Palestinian dialect).

Phase I: Arabization and Localization
The first step was the translation of the Banking77 from English into MSA.We used Google Translate API to translate the 13,083 queries.For each original English query, j, where 0 < j < m and m = 13, 083, we form the following tuple: where q i j is the query's intent, q En j is the original English query from Banking77, q M SA 1 j is the MSA translation, q M SA 2 j is a second MSA query, q P AL 1 j is the Palestinian query, and q P AL 2 j is a second Palestinian query.
Each annotator was asked to understand the English query and its intent, then: (i) review q M SA 1 j , and revise it if needed; (ii) optionally write q M SA 2 j , (iii) write a q P AL 1 j query, and (iv) optionally write a q P AL 2 j query.The annotators performed these steps according to the following arabization and localization guidelines: • q M SA 1 j should be revised in case of incorrect translation.We also ensured the translation is adapted to the banking domain.For example, transfer was incorrectly translated into /naql (ship) instead of /th .wyl (money transfer); activate was translated to /tnšyt ., which is not semantically wrong, but it should be /taf yl , as it is the common term used in the banking domain.The total number of revised translations is 2,104 (∼ 16%).
• q M SA 2 j is optionally written by the annotator if there is a need to add an extra formulation of the MSA query.For example, Personal Identification Number might be translated in q M SA 1 j as ( ) and ( ) as a second formulation in q M SA 2 j .
• q P AL 1 j is the formulation of the query in the Palestinian dialect, reflecting the terminology Palestinians naturally use in banking services.
• q P AL 2 j is optionally written by the annotator if there is a need to add an extra formulation of the query in the Palestinian dialect.
This phase was carried out by 26 annotators, who are 3 rd and 4 th year college students.Each annotator was given about 500 q En j queries and their translations (q M SA 1 j ) to revise.Based on q En j and q M SA 1 j , annotators also provided q M SA 2 j , q P AL 1 j , and q P AL 2 j .When generating PAL queries, annotators had access to both English and MSA queries, which may bias the PAL query towards MSA.However, we verified that this is not a concern as the lexical overlap between MSA and PAL is significant (Section 3.3).Furthermore, in order to diversify the queries, we avoided having all queries in one intent reviewed and written by one annotator only.Instead, each intent was divided among multiple annotators, usually 2-5 annotators.

Phase II: Review
To control and verify the quality of the data generated in Phase I, we performed a final manual review.Each of the 26 annotators, employed for phase I, was assigned a set of queries to review.On average three intents were assigned to each reviewer and we ensured that all queries belonging to one intent are assigned to the same reviewer.In order to increase data labeling consistency, we added the constraint that classes assigned to one reviewer should be relevant to each other (i.e., card arrival, card linking, card activation).Each reviewer was asked to pay attention to the following issues: (i) The MSA and Palestinian queries should be acceptable, semantically correct and well-formulated; (ii) all queries in one intent belong to that intent, and not to other intents (labeling consistency); and (iii) spelling mistakes are ignored in order to simulate common errors and noise in real NLP systems, especially in live chat queries.
Once the review is complete, we revised duplicate queries by introducing additional variations to make them unique.Duplicate queries can arise when we have many-to-one translations, in other words, multiple English queries are translated into one Arabic query (see examples in Table 2).
Our final ArBanking77 dataset (Table 3) consists of 31,404 queries in total, 2.4x larger than the Banking77 dataset.On average, there are 408 queries per intent (202 MSA queries/intent and 206 Palestinian queries/intent).We further divided our training data into train and validation sets, by sampling 90% of the queries in the ith class to the training set and the remaining 10% were included in the validation set.This is contrary to the train/test only split cited in (Casanueva et al., 2020), in which they stated small data size as the reason for not introducing a validation set.
Table 4 presents some statistics about ArBank-ing77.From Table 4 we observe that the dialectal queries are shorter than their corresponding MSA queries.In MSA the average number of words in a query is 9.85, while it is 8.06 in the Palestinian queries.This is expected as in some cases dialectical Arabic omits interrogative nouns such as ( ), so an MSA query such as ( /are there age requirements?) is phrased in Palestinian dialect as ( ).In other cases, functional words such as prepositions ( /from or about, /on or above, /to or at, /in or into) are used as prefixes or suffixes.For instance, the phrase ( ) in MSA is ( / l umr ) in the Palestinian dialect, where ( ) is used as a prefix in the word ( / l umr ).For discussion on the orthography of Arabic dialect, see (Nayouf et al., 2023;Haff et al., 2022;Jarrar et al., 2014)

Lexical Relation between MSA and PAL
Arabic is a highly diglossic language, meaning that two or more distinct languages are spoken within a given region, which is a phenomenon in the Arab countries (Jarrar, 2021).Sometimes MSA is significantly different from colloquial dialects (Jarrar et al., 2023b;Naser-Karajah et al., 2021), where they can be mutually unintelligible.Because of that MSA and PAL have many differences making it harder to apply MSA NLP tools to PAL.In this section, we will study the lexical difference between MSA and PAL, although the differences extend beyond lexical to include morphology, phonology, orthography, semantic and syntactic.
To measure the lexical overlap between MSA and PAL, we computed the Jaccard Index for each parallel pair (MSA and PAL) and averaged the results across the entire dataset.We found that the mean Jaccard index is 0.16, median 0.13 and standard deviation 0.13.Others have also studied the lexical overlap between MSA and PAL and reported similar results.For instance, (Kwaik et al., 2018) measured the overlap between MSA and other dialects including PAL on two parallel datasets, the Parallel Arabic Dialect Corpus and Multi-Dialectal Arabic and reported Jaccard Index of 0.19 and 0.16, respectively.This shows that for diaglossic languages such as Arabic, training on one variation is not necessarily extensible.Later in section 5.1, we will explore zero-shot learning to illustrate the effect of lexical differences on model performance.

Intent Detection Model
We fine-tuned a BERT-based model on an intent detection task using the ArBanking77 dataset.In this section, we will go over the model details.

Model Architecture
Our model is based on BERT, a transformer-based language representation for natural language processing (Devlin et al., 2018).BERT was developed by Google in 2018 as a solution for the most common language tasks such as sentiment analysis, named entity recognition, and question answering.BERT is built using transformers, which is a deep learning architecture that solves sequenceto-sequence tasks in NLP and relies on the attention mechanism that learns the alignment between words in a given sequence.Transformers include two components: an encoder that encodes the input English Queries Arabic Query Can you tell me the restrictions for the disposable cards?Can you please inform me of the restrictions for the disposable cards.How is an exchange rate calculated?How are your exchange rates calculated?
Table 2: Examples of many-to-one English-Arabic translation.text and a decoder that produces a prediction for the task, such as predicting masked token or predicting next sentence.In this paper, BERT encoder is fine-tuned on Arabic intent detection task using the ArBanking77 dataset.
For intent detection, a single linear layer was added on top of BERT transformer layers to perform the intent classification task.

Model Training
We fine-tuned multiple pre-trained transformer models, which will be discussed in the next section.The hyperparameters we used are: learning rate, 1e −3 < η < 5e −5 , and batch size, B = {16, 32, 64}.We ran approximately 30 experiments, with an average run-time per experiment < 2 hours, depending on model parallelism.The best performing hyperparameters were η = 4e −5 and B = 64, with maximum sequence length of 128, maximum of 20 epochs and early termination if there is no improvement on the validation data after five epochs.Model training was performed using our Nvidia Tesla P100 16GB GPU card.

Experiments and Results
We ran multiple experiments with different models and data configurations.In section 5.1, we evaluate zero-shot learning, section 5.2 benchmarks multiple pre-trained transformer models on Arabic data, section 5.3 simulates low-resource settings and section 5.4 simulates different spelling errors that are commonly found in the Arabic language.We report the model performance on the test set using macro F1, precision and recall scores.
When training the models on the full dataset, we used the train, validation and test split listed in Table 3, where 21,559 queries used for training and 2,464 served as the validation set.In low-resource settings we experimented with different training and validation data sizes (Section 5.3), but the test set size remained at 7,381 queries.In noise and error simulation experiments we used the same test set with 7,381 queries, but errors were injected into the test queries as we will explain in Section 5.4.

Zero-Shot Cross-Lingual Transfer Learning
In some cases, zero-shot cross-lingual transfer learning can yield good results and may help us avoid the manual data annotations.In this section, we study how zero-shot cross-lingual transfer learning perform on both MSA and PAL using multilingual BERT (mBERT) (Devlin et al., 2018) and GigaBERT (Lan et al., 2020).mBERT is trained on 104 languages including Arabic, which is based on MSA data from Wikipedia with less than 1.4 gigabytes and only 7,292 tokens (Alammary, 2022).
GigaBERT was trained for Arabic NLP tasks and English-to-Arabic zero-shot transfer learning.The data contained about 13 million articles from different sources and augmented with code-switched samples to improve cross-lingual learning.
In one set of experiments we evaluated zeroshot cross-lingual transfer learning on PAL test set by fine-tuning mBERT on ArBanking77 MSA training dataset, which yielded 0.5968 F1-score (Table 5).In the second set of experiments we performed zero-shot cross-lingual transfer learning on both MSA and PAL by fine-tuning GigaBERT and mBERT on the English Banking77 training data.On MSA, GigaBERT and mBERT achieved 0.5047 and 0.1774 F1-score, respectively.The performance is even lower on PAL with GigaBERT and mBERT performing at 0.3507 and 0.0903 F1score, respectively.These experiments demonstrate the performance of multilingual pre-trained models falls behind on MSA and is significantly lower for dialectical Arabic, which begs the need for MSA and dialectical Arabic data annotations.

Pre-Trained Transformers Benchmark
As we observed in the pervious section, multilingual pre-trained transformers did not perform well on MSA and PAL.In this section, we evaluate various Arabic pre-trained transformer models in addition to mBERT on ArBanking77 dataset.We benchmark against the following models: AraBERT (Antoun et al., 2020): trained on two major datasets, Abu El-Khair, a 1.5B words Arabic Corpus (El-Khair, 2016) and the Open Source International Arabic News Corpus (OSIAN), which consists of 3.5 million articles (1B tokens), from 31 news sources in 24 Arab countries (Zeroual et al., 2019).The final size of AraBERT dataset is 70M sentences, corresponding to about 24GB of text.ARBERT (Abdul-Mageed et al., 2021): trained on 61GB (6.5B tokens) of MSA text in books, news articles, Gigaword (Parker et al., 2011), Open Superlarge Crawled Almanach coRpus (OSCAR) (Ortiz Suárez et al., 2019), OSIAN and the Wikipedia Arabic (Attardi, 2015).MARBERT (Abdul-Mageed et al., 2021): trained on dialectical Arabic collected from Twitter.MARBERTv2 (Abdul-Mageed et al., 2021): trained on the ARBERT MSA data in addition to dialectical Arabic, has longer sequence length, trained for more epochs and contains a total of 29B tokens.QARiB (Abdelali et al., 2021): Qatar Computing Research Institute (QCRI) Arabic and Dialectal BERT trained on Arabic Gigaword Fourth Edition (1B words), Abu El-Khair Corpus (1.5B words) and Open Subtitles (0.5B words).CAMeLBERT-Mix (Inoue et al., 2021): trained on a mix of MSA data that includes Gigaword Fifth Edition, Abu El-Khair Corpus, OSIAN, Arabic Wikipedia, OSCAR, dialectical Arabic that covers Levantine and Gulf regions, and a subset of the OpenITI corpus (Nigst et al., 2020) Results for those models are presented in Table 6, sorted by the PAL test F1-score.AraBERTv2 gives the best F1-score on both MSA and PAL with 0.9209 and 0.8995, respectively.In the remaining experiments, we will use AraBERTv2 given that it achieved the best performance.
Those results are based on fine-tuning the models on the manually reviewed translations.To see if the manual review of the translations improves the model performance we fine-tune two additional AraBERTv2 models.One using the original machine translated data and the second with the manually reviewed data.Note that both training datasets contain MSA only data, since Google Translate will produce MSA translation.Fine-tuning with the original translations results in F1-scores of 0.9099 and 0.7945 for MSA and PAL, respectively.When the data is manually reviewed the F1-scores are 0.9117 and 0.7918 for MSA and PAL, respectively.A very small difference, yet it was important to review the translations to adapt it to the banking domain.

Low-Resource Simulation
This section aims to investigate the impact of the size of the training set on the model performance.Since data labeling is typically expensive it is important to estimate the number of samples one needs to achieve good and acceptable accuracy.We conducted several experiments with different training data sizes: 20% (of the training queries per intent were randomly sampled), 50% and 100% (the entire training set).Throughout all the experiments, we evaluated our model on same test set, which contains 7,381 queries.
Results with different low-resource settings are presented in Table 7.The average increase in F1-score as we increase the training data size is about 2.26% and 3.16% on the MSA and PAL test datasets, respectively, which indicates the impact of the training dataset size is more noticeable on the dialectical Arabic.We also notice that the performance on the PAL test is consistently lower than MSA test.The performance gap between MSA and PAL is 2.14%, 2%, and 3.95% F1-score when training with 100%, 50% and 20% of the data, respectively.The largest performance gap between MSA and PAL is at the lowest setting (20%), after that the performance gap stabilizes.Lower performance on dialectical data could be due AraBERT (Antoun et al., 2020)  Palestinian dialect during the pretraining phase.In general, dialectical Arabic is typically noisier and does not follow consistent orthography as MSA.Surprisingly, the performance on the MSA and PAL test sets using only 20% of the training data is impressive at 0.8758 and 0.8363 F1-scores, respectively.This indicates that we can expect to achieve an acceptable performance on other low-resource dialectical Arabic on intent detection task.

Noise and Error Simulation
Colloquial words, misspellings and different word variations present a challenge to chatbots.Therefore, in this section we aim to measure the robustness of our dataset and model.We experimented with three types of error and noise simulations: (1) common spelling errors (sim c ), (2) simulated errors (sim s ), and (3) keyboard-related errors (sim k ) -see Appendix A for the details.
We performed experiments with and without training data augmentation.In case of augmentation, train and test sets were augmented in slightly different fashion.For training, about 50% of the queries were augmented with sim s and the other 50% were augmented with sim k .The original data was combined with the augmented data resulting in 43,118 queries in the training set.We evaluated the model on three versions of the test set, one version injected sim c errors in each query, the second version using sim s and the third with sim k .
Results of the combined low-resource and error simulations are summarized in Table 8.Due to the number of experiments, we only reported the macro F1-score.We see a similar trend to the results presented in Section 5.3, the model performance on the PAL test set is consistently lower than MSA test set across all experiments.We also notice that the model is more sensitive to some errors introduced into the test set.
We performed the experiments using two trained models, with and without training augmentation.In both models we see similar behaviour, where we observe that the average drop in performance, when reducing training set size, on PAL-sim c across all data settings is about 3.38%, compared to 2.37% on MSA-sim c .Similar pattern is also observed on the PAL-sim s and MSA-sim k , with an average performance drop of 3.39% and 2.16%, respectively.However, we see a lower performance on PAL-sim s with an average drop in F1-score by 4.2%, compared to 2.19% on MSA-sim s .From that, we learn that the model performance is stable on MSA regardless of the type of errors we inject into the data, however, on PAL we see more volatility and sensitivity in the model performance when injecting sim s errors.Those findings reveal that BERT is more susceptible to the removal of spaces in dialectical Arabic since that results in combining two or three tokens into one.This issue is exacerbated further in dialectical Arabic since it lacks consistent orthography compared to MSA.
Despite those results, we see that augmenting the training data did help close the performance gap between the PAL and MSA. Figure 2 zooms in a little more into the performance on MSA-sim s and PAL-sim s with and without training augmentation.Three observations to make from Figure 2: 1) MSA performance is better than PAL regardless of data augmentation, 2) augmenting the training data closes the performance gap between PAL-sim s (augmented) and MSA-sim s (without augmentation), 3) the average F1-score gain after training    Figure 3 shows that training data augmentation does not affect the performance on the clean MSA and PAL test sets.On the contrary, at the lowest resource settings the augmented model outperformed the non-augmented on MSA and PAL by 0.43% and 0.58%, respectively.At 50% and 100% settings, both the augmented and non-augmented models' performance converge on MSA and PAL.

Conclusion
In this paper, we presented the ArBanking77 dataset, consisting of queries in both MSA and Palestinian dialects in the banking domain.As far as we know, ArBanking77 is the first Arabic intent detection dataset in the banking domain.The dataset contains 31,404 queries and 77 intents.The data was then used to fine-tune a BERT-based model for the intent detection task, resulting in an F1-score of 0.9209 for MSA and 0.8995 for PAL.We also simulated low-resource settings and found that the model is robust and with only 20% of the data, model performance on PAL and MSA dropped by only 6.32% and 4.51%, respectively.We noted that training data augmentation does not negatively affect the model performance on the clean MSA and PAL test sets.In fact, at the lowest resource settings (20%) the augmented model out-performed the non-augmented model on both MSA and PAL.
We performed additional data augmentation to simulate errors, misspellings, and other mistakes that are common in real NLP systems.We observed the accuracy on PAL-sim s suffers greatly when the model is trained on 20% of the non-augmented data.Augmenting the training data closes the performance gap on PAL-sim s by about 5%.This indicates that BERT is susceptible to some errors, especially in dialectal Arabic which has less consistent orthography than MSA.It is also noticeable that the relative drop in accuracy between the 20% and 50% training sets is much larger than 50% and 100% case.This implies that the negative effect of the introduced errors in the dialectical Arabic is inversely proportional to the amount of data used in the train set.Finally, based on the low performance using zero-shot learning on MSA and PAL and a slight lexical overlap between them, we concluded that there is an urgent need to annotate MSA and dialectical Arabic.

Limitations
Our dataset is limited to MSA and Palestinian dialect and covers only 77 intents.Applying our models and data to dialects others than MSA and PAL may not yield accurate intents.Furthermore, our data covers intents that are commonly found in traditional banking.Additional intents may need to be studied from non-traditional banking such as Islamic banks.We plan to extend our dataset to cover more Arabic dialects and obtain data from non-traditional banking institutions in the Arab region to better understand the difference in intents compared to the traditional banking.Moreover, we want to explore natural language understanding in the banking domain by combining named entity recognition with intent detection.
We can further improve model performance by adding additional auxiliary loss functions such as contrastive loss, which will help align the token representations between the MSA and PAL queries.Furthermore, due to data limitation, the models trained on the data, including Banking77, perform intent classification using a single utterance.In practice, the query has a context, preceding utterances, that can provide important signal to the model, which may lead to better performance.

Figure 1 :
Figure 1: Examples queries and their intent.

Figure 2 :
Figure 2: MSA-sim s vs. PAL-sim s F1-scores with lowresource settings, (Augmented) indicates that the training data was augmented.

Figure 3 :
Figure 3: MSA vs. PAL clean sets F1-scores with lowresource settings and data augmentation, (Augmented) indicates that the training data was augmented.

Table 1 :
Statistics of the Banking77 English dataset

Table 3 :
Size of ArBanking77

Table 4 :
Statistics of ArBanking77 dataset

Table 5 :
not being sufficiently exposed to the Performance of zero-shot learning.

Table 6 :
Performance of various pre-trained transformers on ArBanking77

Table 7 :
Results on the ArBanking77 MSA and PAL test sets in low-resource settings with augmented data on PAL-sim s (4.12%) is larger than MSA-sim s (2.2%).The improvements are less noticeable on sim c and sim k .

Table 8 :
Performance in terms of F1-scores of models trained on the combined MSA and PAL datasets when simulating low-resource setting (20% of the data) and different types of noise, "None" refers to the clean dataset while the percentages in the header indicate the percentage of training data used.