TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

Despite the purported multilingual proficiency of instruction-finetuned large language models (LLMs) such as ChatGPT and Bard, the linguistic inclusivity of these models remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level dialectal variants. Our analysis indicates that LLMs may encounter challenges with dialects for which minimal public datasets exist, but on average are better translators of dialects than existing commercial systems. On CA and MSA, instruction-tuned LLMs, however, trail behind commercial systems such as Google Translate. Finally, we undertake a human-centric study to scrutinize the efficacy of the relatively recent model, Bard, in following human instructions during translation tasks. Our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.


Introduction
Large language models (LLMs) finetuned to follow instructions (Wei et al., 2021;Wang et al., 2022;Ouyang et al., 2022) have recently emerged as powerful systems for handling a wide range of NLP

Arabic varieties
LLMs/MT models MT evaluation Figure 1: Experimental setup for our evaluation.We evaluate multiple language models on different Arabic varieties.
In spite of drawbacks such as their closed nature, computational costs (Dasgupta et al., 2023), and biases they exhibit (Ferrara, 2023), closed LLMs remain attractive primarily due to their remarkable performance (Bang et al., 2023a;Laskar et al., 2023a).It is thus important to fully understand the full capabilities of these closed models.Although there has been a recent flurry of works attempting to evaluate ability of LLMs to carry out NLP tasks, many of their remain opaque.This is especially the case when it comes to understanding how LLMs fare on different varieties and dialects of several popular languages and on vital tasks such as machine translation (MT).For example, the extent to which LLMs can handle MT from Arabic varieties into other languages is unknown.
Another challenge is more recent models such as Google's Bard are yet to be evaluated and understood.Bard was released in 41 different languages, which makes it a particularly attractive target for MT evaluation.This is also the case given Google's strong history of investment in MT (Wu et al., 2016a).In this work, we offer a thorough evaluation of LLMs on MT from major Arabic varieties into English (Figure 1).Namely, we evaluate ChatGPT, GPT-4, and Bard on MT of ten Arabic varieties into English.Since there is usually concerns about downstream evaluation data leaking into LLM pretraining, which involves data collected from the web, we benchmark the models on new test sets that we manually prepare for this work.Our evaluation targets ten diverse varieties of Arabic.Namely, we evaluate on Classical Arabic (CA), Modern Standard Arabic (MSA), and several country-level Arabic dialects such as Algerian and Egyptian Arabic.
Bard provides three different drafts for each text input we ask it to translate.Contents of the three drafts are diverse, providing us with excellent contexts to analyze the degree to which the model adheres to our prompts.We leveraging these contexts to carry out a human evaluation study investigating usefulness of the model, allowing us to reveal a number of limitations of Bard.We carefully analyze these limitations against the different Arabic varieties we target, thus affording even better understanding of the model's ability to translate from Arabic.
Overall, our work offers the following contributions: (i) We offer a detailed MT evaluation of instruction finetuned LLMs on ten diverse varieties of Arabic.
(ii) To the best of our knowledge, our work is the first to assess performance of Bard on NLP tasks in any language, and on Arabic MT in particular.
(iii) We introduce a new manually created multi-Arabic dataset for MT evaluation that has never been exposed to any existing LLM.
(iv) We extensively evaluate Bard through a human study to analyze its behavior in terms of usefulness.We examine how well the model follows human instructions when tasked with translating across ten different Arabic varieties.
The rest of the paper is organized as follows: In Section 2, we review previous research evaluating LLMs on NLP tasks in general and MT in particular.In Section 3, we introduce our newly developed multi-Arabic MT dataset.In Section 4, we describe our evaluation methods In Section 5, we present our results and the main findings obtained from comparing ChatGPT and Bard to various commercial MT products.In Section 6, we present our human study analyzing Bard's helpfulness, particularly in terms of its ability to follow human instructions in MT.We conclude in Section 7.

Related Work
Evaluation of LLMs on NLP tasks.A growing number of works have focused on evaluating Chat-GPT and other LLMs on a wide range of NLP tasks.Notably, Laskar et al. (2023a) evaluate ChatGPT on 140 diverse NLP tasks spanning across multiple categories.The authors show that although Chat-GPT is effective on various NLP tasks, its ability to solve challenging tasks such as low-resource machine translation with standard prompting is very limited.Ziems et al. (2023) evaluate 13 different LLMs including ChatGPT on 24 computational social science tasks and find that for many classification tasks, ChatGPT is on par with supervised models while excelling at generation tasks.Qin et al. (2023) evaluate ChatGPT on 20 different datasets spanning across seven task categories.They find that ChatGPT is better at solving tasks that require reasoning capabilities but falls behind supervised models on tasks such as sequence tagging.Evaluating MT ability of ChatGPT.Both Jiao et al. (2023) and Ogundare and Araya (2023) find that GPT-4 is on par with commercial translation tools for high-resource languages.However, they find the model to lag behind for low-resource languages.To fix this issue, the authors propose pivotprompting where a low-resource source language is first translated into a high-resource pivot language and then from the pivot language back to the low-resource target language.Evaluation by Peng et al. (2023) shows that ChatGPT can surpass commercial systems such as Google Translate on many translation pairs.Additionally, Peng et al. (2023) find that adding task and domain-specific information in the prompt can improve the robustness of the MT sytem.This observation also corroborates the findings by Gao et al. (2023).Zhu et al. (2023) argue that despite being on par with commercial systems, ChatGPT still falls behind fully supervised methods such as NLLB (NLLB et al., 2022) on at least 83% translation pairs out of 202 English-centric translation directions.2023) study the hallucination phenomenon in MT systems and find that lowresource languages and complex translation scenarios such low resource translation direction are prone to hallucination.Wang et al. (2023); Karpinska and Iyyer (2023) show that ChatGPT can match the performance of fully supervised models for document-level translation.Bang et al. (2023b) find that when it comes to translation from highresource languages into English, ChatGPT is comparable with the fully supervised model but that performance degrades by almost 50% when translating from low-resource languages into English.Huang et al. (2023) propose a prompting technique called cross-lingual-thought prompting (XLT) to improve cross-lingual performance for a wide range of tasks, including MT.Similarly, Lu et al. (2023b) asks ChatGPT to correct its mistakes as a way to improve the model translation quality.To accurately translate attributive clauses from Japanese to Chinese, a pre-edit scheme is proposed in Gu (2023), which improves accuracy of the translation by ∼ 35%.Lu et al. (2023a) proposes Chainof-dictionary (CoD) prompting to solve rare word translation issues.Prompting with CoD improves the performance of ChatGPT for both X-En and En-X language directions.Arabic MT.Arabic MT to date has primarily focused on two main themes: translating MSA and translation of Arabic dialects.

Guerreiro et al. (
MSA MT.The development of MSA MT systems has gone through various stages, including rule-based systems (Bakr et al., 2008;Mohamed et al., 2012;Salloum and Habash, 2013) and statistical MT (Habash and Hu, 2009;Salloum and Habash, 2011;Ghoneim and Diab, 2013).There have also been efforts to employ neural machine translation (NMT) (Bahdanau et al., 2014), methods for MSA.For instance, several sentence-based Arabic to English NMT systems, trained on different datasets, have been presented in Akeel andMishra (2014), Junczys-Dowmunt et al. (2016), Almahairi et al. (2016), Durrani et al. (2017), andAlrajeh (2018).Furthermore, researchers have explored Arabic-related NMT systems for translating from languages other than English to MSA, including Chinese (Aqlan et al., 2019), Turkish (El-Kahlout et al., 2019), Japanese (Noll et al., 2019), four foreign languages2 (Nagoudi et al., 2022a), and 20 foreign languages (Nagoudi et al., 2022b).Dialectal Arabic MT.A number of works focus on translating between MSA and various Arabic dialects.For instance, both Zbib et al. (2012) and (Salloum et al., 2014) combine MSA and dialectal data to build an MSA/dialect to English MT system.Sajjad et al. (2013) use MSA as a pivot language for translating Arabic dialects into English.Guellil et al. (2017) (Muennighoff et al., 2022) in few-shot settings (0, 1, 3, 5, and 10) on four X-Arabic and two code-mixed Arabic-X pairs.They show that providing in-context examples to ChatGPT achieves comparable results to a supervised baseline.Alyafeai et al. (2023) evaluate Chat-GPT and GPT on 4, 000 Arabic-English pairs from Ziemski et al. (2016) and find that SoTA model outperforms ChatGPT and GPT-4 by a significant margin.These works, however, only consider a lim-ited number of Arabic varieties.Nor do they conduct a thorough analysis of these systems for MT tasks.Additionally, none of the works evaluates Bard.Our work bridges these gaps by performing a comprehensive evaluation of these systems on a wide range of Arabic varieties.We also conduct our study on novel in-house data that to our best knowledge is not presented in the training data of LLMs such as ChatGPT and Bard.
We provide a summary of the related work organized in a table in Appendix ChatGPT for Arabic MT.B.

Datasets
Our goal is to provide a comprehensive evaluation of MT on ChatGPT 3.5 and 4 and Google Bard, focusing on their performance across ten different varieties of Arabic.These can vary across time (i.e., old vs. modern day) and space (e.g., countrylevel geography) as well as their sociopragmatic functions (e.g., standard use in government communication vs. everyday street language).Before introducing our dataset, we provide a brief background about Arabic and its varieties.Arabic, the collection of languages spoken by approximately 450 million people across the Arab world, encompasses a broad spectrum of varieties.Classical Arabic (CA) is known as Quranic Arabic, the language of the Quran (Rabin, 1955), which emerged from the medieval dialects of the Arab tribes and was spoken early in Mecca around 1, 500 years ago or the sixth or seventh century AD.CA is considered the most eloquent form of Arabic and is persevered notably in the Holy Quran and pre-Islamic epic poems (Versteegh, 2014).It is often described as exhibiting archaic words, figurative speech, and rhyming sentences that are no longer (or less frequently) used in MSA and dialect Arabic varieties.Modern Standard Arabic (MSA) (Holes, 2004), on the contrary, is deeply rooted in CA that has been lightened to a great extent to encompass the modern uses in Modern literature, poetry and official statements.MSA additionally serves as the standardized language for formal events, news broadcasts, sermons, and formal communication.We now explain how we acquire our dataset for each Arabic variety.CA.We collect a total of 200 sentences from the OpenITI (Nigst et al., 2020) dataset.The sentences are drawn from various books that belong to the second century.We manually curate 200 sentences from the Open Islamic Texts Initiative (OpenITI) (Nigst et al., 2020) dataset, namely from the latest 2022.16version, which includes a collection of premodern Arabic works.This updated version features a comprehensive library of 10,342 books.The sentences were chosen based on a set of specified criteria: Initially, we identify books originating from the first and second-century Anno Hegirae (in the year of the Hijra), excluding those written after this period.The authors' names and dates of the books confirm the verification of these works within this period.Then we compile a collection of fifteen distinctive books, including notable works like Abdullah Ibn AlMuqfaa's "Al-Adab Al-Kabir" and "Al-Adab Al-Saghir", Mohamed Idis Al-Shafi's "Al-Umm", "Al-Risala", and "Al-Adab Wal-Muraa", among others.Subsequently, we extracted sentences of a minimum of ten words that conveyed a comprehensive meaning.MSA.We collect a total of 200 sentences from current events news picked from two online news websites: Aljazeera and BBC Arabic.The curated sentences showcase various news genres, including political, social, and sports.Various Dialects.We manually select a dataset of dialectal Arabic from an in-house project where we transcribe TV series collected from YouTube videos belonging to Arabic dialects (Table 1).Again, we use 200 sentences from each dialect, resulting in a total of 1,600 sentences across eight dialects.The dialects belong to North African countries such as Algeria, Morocco, and Mauritania; Gulf area dialects, namely Emirati and Levantine Arabic (focusing on Palestinian and Jordanian); and Egyptian Arabic.For all varieties, we collect sentences that are at least ten words long.We present one sample from each dataset in Table 1.Statistics of the datasets across the Arabic varieties is presented in Appendix C.

Prompt Design
The term prompt refers to the set of instructions used to program an LLM with a goal to steer and enhance its purpose and capabilities (White et al., 2023).Prompt can influence subsequent interactions with the model as well as its generated outputs.Therefore, it is important to clearly identify the right prompts to obtain the desired outcome for a particular task.our translation task, we set up a pilot experiment that we now describe.Pilot experiment.In our pilot experiment, we investigate three prompt candidates.To limit the search space, we perform this experiment only with ChatGPT.We experiment with both Arabic and English prompts to concisely instruct ChatGPT to translate from an Arabic variety to English, again restricting our search space to MSA as a variety that is known to overlap with other varieties at all linguistic levels (Abdul-Mageed et al., 2020;Habash, 2022).We also experiment with an elaborate English prompt that clearly defines the role and the objective of ChatGPT before asking the model to carry out the translation task.We then evaluate the performance of ChatGPT on 100 MSA→English samples.We present the prompt templates and the corresponding performance we acquire in Table 2.
Evaluation.As evident, the concise English prompt outperforms the other two prompts, including the Arabic counterpart (by 1∼2 BLEU scores).This result substantiates findings in prior works (Khondaker et al., 2023;Lai et al., 2023) regarding the superiority of English prompt on Chat-GPT over non-English prompts.Therefore, in the rest of the paper we employ the concise and direct English prompt to conduct our experiments.

N-Shot Experiments
We run ChatGPT MT generation under 0-shot, 1shot, 3-shot, and 5-shot settings.For a particular translation task, we always select the prompts from the same set of training examples.This means for a k-shot setting, we make sure that if a training sample is selected then it will also be selected for n-shot settings where n > k.We generate translation with ChatGPT (gpt-3.5-turbo 4 , an optimized version of GPT-3.5 series) by setting the 4 Snapshot of gpt-3.5-turbofrom June 13th 2023.
temperature to 0.0 to ensure deterministic and reproducible results.In addition, we restrict the maximum token length to 512 for all the generation tasks.We also manually evaluate ChatGPT using the GPT-4 engine and the Bard model under the 0-shot condition only using the web interface of each model.5

Evaluation and Baselines
Evaluation metrics.Different evaluation metrics are usually employed to automatically evaluate MT systems.These metrics are often based on word overlap and/or context similarity between references and model outputs.In our work, we employ both types of metrics to evaluate the quality of various translation systems that we consider in our study.Namely, we use BLEU (Papineni et al., 2002), ChrF (Popović, 2015), ChrF++, and TER (Snover et al., 2006).We provide a detailed description of each metric in Appendix 4.1.
Baselines.We compare instruction-tuned LLMs to a number of MT systems, including both commercial services (Amazon, Google, and Microsoft) as well the supervised NLLB-200 system (NLLB et al., 2022).We provide more details about each of these systems in Appendix 4.2.

Results and Discussion
We evaluate all models on X-English translation direction where X is Arabic (MSA and CA) and its varieties.As mentioned earlier, we evaluate LLMs (ChatGPT, GPT-4, and Bard) in n-shot settings.We report BLEU and ChrF++ in Table 4 and 5.We report additional metrics in section Appendix A. We summarize our main findings here.
Is GPT-4 better than ChatGPT?In most cases, yes.GPT-4 consistently outperforms ChatGPT on many dialects and varieties.However, for ALG, EGY, JOR, UAE, and YEM, ChatGPT 0-shot performs better than GPT-4 in 0-shot setting.Overall GPT-4 0-shot outperforms ChatGPT 0-shot by almost 3 and 4 points in terms of average BLEU and ChrF++ scores, respectively.Additionally, GPT4 in 0-shot setting is on par with ChatGPT in the 5-shot setting.
Is ChatGPT/GPT4 better than Bard?In most cases, yes.For fairness, we compare Bard, Chat-GPT, and GPT-4 only under the 0-shot condition.

Wrong Target Language
Input: Ref: Najm shouldn't we know our enemy first to know how to act?Output:

No Translation
Input: Ref: I'd kill one of you, then go turn myself in, and defend myself.Output:

Content Filtering
Input:

Ref:
And even those men for whom we'll become thin, we have no use for them.Output:

Input:
Ref: No, no sir, no sir, no, Burhan should not be executed.

Output:
" No, no, my lord, no, my lord, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no" In the majority of the varieties, either ChatGPT or GPT-4 outperform the best Bard draft (i.e., Draft 1).Our results show that Bard is better than both of these models in only two cases (i.e., CA and JOR).
When trained in-context with any number of shots where n > 0, however, ChatGPT outperforms both GPT-4 and Bard (again, the latter two only evaluated under 0-shot in this work).
Is ChatGPT/GPT4 better than commercial systems?Yes, but only on dialects.We evaluate three commercial translation systems, namely, Amazon, Microsoft, and Google Translate.Among commercial systems, we find Google From our evaluation results in Tables 4 and 5, we observe that commercial systems are better at translating CA and MSA but fail to produce high-quality translations when it comes to dialectal Arabic.ChatGPT (in few-shot setting) and GPT-4 (in 0-shot) are on par or better than the bestperforming commercial system (i.e., Google Translate) for all Arabic dialects we evaluate.The average BLEU score of ChatGPT and GPT-4 in fewshot setting is 20.70 (3-shot) and 19.76 (0-shot), respectively, compared to 20.58 for Google Translate.However, we notice that Google Translate outperforms ChatGPT and GPT-4 on MSA by a significant margin (while it stays behind on other dialects).Hence, we conclude that ChatGPT and GPT-4 are better translators of Arabic dialects than the commercial Google Translate system.We find similar patterns in other metrics.
Is ChatGPT/GPT-4 better than supervised baselines?Yes, it is.In our study, we evaluate NLLB (NLLB et al., 2022) as the supervised baseline and find that both ChatGPT and GPT-4 can outperform the supervised translation model (NLLB) in the 0-shot setting.The average BLEU score for NLLB is 12.02 compared to 17.18 and 19.76 of ChatGPT and GPT-4 respectively.Similar to the commercial systems, supervised baseline (NLLB) does well on CA and MSA and is on par with ChatGPT and GPT-4.However, both ChatGPT and GPT-4 outperform the supervised  baseline on dialectal translation by a significant margin.
Is NLLB with dialects as source better than vanilla NLLB?Yes, it mostly is when the dialects match.Our supervised baseline, NLLB, takes dialects of the source into consideration.For example, both JOR and PAL dialects in NLLB can be defined as South Levantine (JOR, PAL)→South Levantine.In addition, source dialects like EGY and MOR can be defined in their actual forms while YEM can be defined as Taizzi.The column NLLB (Dia) in Table 4 (and the same applies to Table 5) provides BLEU score where the NLLB model treats the input as a particular dialect.We find that when the actual dialect matches the appropriate mapping with this NLLB source dialect, we acquire performance.One exception is the case of MOR, where NLLB does poorly compared to MSA.
Is Bard a good instruction following model?Not always.We evaluate Bard for our translation using the web interface 6 .We find that Bard can fail to follow the instructions we prompt it with.We further discuss and describe this in Section 6. Bard often provides the main translation output within double quotes (""), which we extract semi-automatically. 7 Additionally, Bard provides three different drafts.
We report results for each draft independently, as well as the average of all three drafts in our results.
6 https://bard.google.com/ 7In order to keep sufficient information to study model behavior, we collect and save all output from Bard (including explanations of translations).Even when we try to prompt Bard to restrict its output to target translation, it did not follow our instructions.

Human Analysis of Bard Helpfulness
Our experience working with Bard reveals that the model does not always follow human instructions.For this reason, we decided to carry out a human study to assess Bard's helpfulness.We simply define helpfulness here as the model's ability to follow human instructions.For each variety of Arabic, we tasked one native speaker to assign one of tags in the set {wrong_lang, no_translation, degeneration, content_filtering} to the model responses.We develop this tagset based on a bottom-up approach where we let the categories emerge from the data.Although this tagset may not be exhaustive, we find it to capture errors we identify with model responsiveness to instructions.We manually label each draft with one or more of the tags in the set of our usefulness error tags.Table 3 shows one example for each of these issues.
The most frequent issue with model usefulness is translating into the wrong target language (wrong_lang), followed by not providing any translation at all (no_translation) (Figure 2).The former is predominantly due to a translation into MSA instead of English, oftentimes prefacing the output with the sentence " ".Interestingly, Bard does not seem to struggle with wrong_lang errors when translating from MSA (and the same scenario almost happens for translating from CA).Instead, Bard tends to mistake the translation task for a text generation one where it generates a couple of paragraphs that start with the input sentence.From Figure 3, it seems that the error rate may be proportional to the resource availability of a given variety (i.e., varieties for which no much data are publicly available tend to suffer from higher error rates).This observation should be couched with caution since the LLMs we evaluate remain closed, with little know about their pretraining as well as finetuning datasets and processes.When we look at each of Bard's drafts separately, we find that the first draft shows a higher number of wrong_lang and content_filtering errors.Meanwhile, draft 2 is the most prone to no_translation errors, with these accounting for 57% of the wrong generations it produces (Figure 4).
Other behavior.While Bard has a feature where it occasionally adds sources to support the information it provides, these sources can be unrelated.For example, it can cite links to GitHub  repositories attached to political news translations.
It also has a tendency to respond to input sentences that are questions the way it would for a Question Answering (QA) task.Sometimes it also produces an opinion about a sentence it translates: " " (This piece of news shocked me; and I am bothered by this tragic accident).Additionally, we find instances where Bard adds details not included in the input sentence, such as its translation of " " as "Elon Musk and Mark Zuckerberg" (where it adds first names as shown in italics).
Bard output format.Bard often provides a detailed breakdown when it performs a translation, either in the form of a list or a paragraph detailing the meaning of each word or phrase.With sentences that are parts of a conversation, Bard also explains the message that the speaker is trying to convey and what emotions they are having.When it comes to sentences from the news domain, Bard provides more context and information about the topic after the translation.We provide examples in Figure 5.

Conclusion
In this work, we evaluate Bard, ChatGPT, and GPT-4on MT of ten diverse varieties of Arabic (with Arabic as the source language).We extend this evaluation to encompass three commercial systems and a supervised model (NLLB) to juxtapose the performance of LLMs under varying conditions.To assess the LLMs's capacity on truly unseen data, we manually create multi-dialectal Arabic dataset for MT evaluation.We find that although the LLMs can do well on some of the varieties we consider, they struggle on especially those on the more scare public data end.As such, these LLMs suffer from not being quite inclusive of the different varieties of even languages they are claimed to perform well on such as Arabic.A rigorous human investigation also underscores a palpable scope for enhancement in Bard's adherence to instructions in the context of MT.Our future works includes evaluating the performance of Bard and other LLMs on more Arabic varieties.

Limitations
We can identify a number of limitations for our work, which we list here.
Coverage.We strive to cover as many varieties of Arabic as possible, and ensure treating both CA and MSA.However, our dialectal varieties do not cover all Arab countries.Although this is somewhat alleviated by the fact that we include dialects from both the Eastern and Western parts of the Arab world (i.e., Asia and Africa), future work can consider evaluating LLMs on other Arabic dialects.Single reference translations.Again, due to the laborious nature of manually translating data from the various dialects and the challenge of finding qualified native speakers to carry out these trans-  lations, our evaluation dataset involves only one single reference of each source sentence.It continues to be desirable to create evaluation datasets with 3 − 5 references for each source sentence.
We alleviate this challenge by providing results in different metrics such that the results are not only based on surface level matching but also similarity of the translation pairs.More references would still be better since different human translators would collectively provide data less prone to human subjectivity or errors.
Evaluation of multiword expressions.While we provide translations of full sentences that may involve multiword expressions, including idioms and proverbs, it would be useful to develop evaluation datasets that focus on these types of expressions as these data could uncover particular types of model capabilities.For example, a model that is able to translate and explain a proverb can be thought of as somewhat knowledgeable about culture and pragmatic phenomena.
Evaluation by different lengths.We provide results on our data regardless of sentence length.In the future, it would be useful to report results based in various sentence lenth bins as longer sentences are usually more challenging to MT models.Again, this is alleviated by the fact that we design our datasets to be at least ten words long from the outset.

Ethics Statement
Intended use.We understand our work likely inspire further research in the direction of exploring the multilingual capabilities of LLMs, especially newly released ones such as Bard.Our findings both highlightt some of the strengthens of these models as well as expose some of their weaknesses and limitations.For example, available LLMs still struggle on translating from dialects of even major language collections such as Arabic.Our work also further showcases the limited capability of Bard to follow simple instructions such as those typical of an MT context.Consequently, we believe our work can provide useful feedback for improving both coverage and usefulness of LLMs.
Potential misuse and bias.Since there exists little-to-no information about the data involved in pretraining and finetuning LLMs we consider, we cannot safely generalize our findings to varieties of Arabic we have not investigated.We conjecture, however, that the models will perform equally poorly on dialects with no or limited amounts of public data.Although our work does not focus on studying biases in the models nor how they approach handling harmful content (Laskar et al., 2023b), we could observe that especially Bard puts a lot of emphasis on filtering harmful and potentially offending language so much that its instruction tuning leads it to interact negatively with the model's usefulness as an MT system.Overall, our recommendation is not to use the models in applications without careful prior consideration of potential misuse and bias.n-grams in the machine translation compared to the reference translations, with higher scores denoting better quality.ChrF++ is an extension of ChrF where the word order is 2. TER (Snover et al., 2006).Translation Error Rate measures translation quality by counting edit operations between the machine and reference translations, providing a lower score for better quality.We use huggingface's implementation of these metrics in evaluate8 package.We use all the default parameters unless otherwise specified above.

Baselines
Google Translate.

Figure 2 :
Figure 2: Distribution of usefulness error types Google Bard suffers from when it fails to follow our prompts.
Percentage of an error type observed in each Arabic variety relative to all error types within that variety.Percentage of an error type observed in each Arabic variety relative to that error type across all varieties.Percentage of an error type observed in each Arabic variety relative to all error types across all varieties.

Figure 3 :
Figure 3: Distribution of Google Bard's error rates across usefulness error categories on Arabic varieties.

Figure 4 :
Figure 4: Percentage of Google Bard's failure to follow the prompt for each draft relative to all errors across all drafts.
(a) Google Bard's translation, explanation and breakdown of one dialectal sentence (from MOR).(b) Google Bard's translation and context of an MSA sentence from the news domain.

Figure 5 :
Figure 5: Examples of Google Bard's translation output.The bottom parts are cropped for readability.

Table 2 :
To determine the right prompt for Performance of ChatGPT on the MSA→English translation task.Our concise English prompt outperforms other prompts in BLEU score.
Table 1: Example sentences from each of the Arabic varieties in our new translation evaluation dataset.

Table 3 :
Examples of errors in Google Bard's ability to follow prompts.For each of the sentences, we use the prompt Translate the following text from [Variety] Arabic [dialect] into English: <s>.
Translate to outperform other commercial systems across all varieties except YEM.The average score for Google Translate is 20.58/42.30compared to 17.42/40.80and 16.06/38.83for Microsoft and Amazon, respectively, in terms of BLEU/ChrF++.

Table 4 :
Results in BLEU scores.Average is the simple mean of all our varieties.We get three drafts (D1, D2, D2) from Bard, reporting each independently and average scores across the three drafts.NLLB is our supervised baseline with source dialect as MSA.NLLB (Dia) is where the NLLB model is dialect-specific, rather than being an MSA model.SB -supervised baseline, Dia -dialect, Var -varieties, M -model, MST-Microsoft Translation, GT -Google Translate.

Table 5 :
Results in ChrF++ Scores.Average is the simple mean of all our varieties.We get three drafts (D1, D2, D2) from Bard, reporting each independently and average scores across the three drafts.NLLB is our supervised baseline with source dialect as MSA.NLLB (Dia) is where the NLLB model is dialect-specific, rather than being an MSA model.SB -supervised baseline, Dia -dialect, Var -varieties, M -model, MST-Microsoft Translation, GT -Google Translate.

Table 6 :
TER Scores.Average is the simple mean of all our varieties.We get three drafts (D1, D2, D2) from Bard, reporting each independently and average scores across the three drafts.NLLB is our supervised baseline with source dialect as MSA.NLLB (Dia) is where the NLLB model is dialect-specific, rather than being an MSA model.SB -supervised baseline, Dia -dialect, Var -varieties, M -model, MST-Microsoft Translation, GT -Google Translate.

Table 7 :
ChrF Scores.Average is the simple mean of all our varieties.We get three drafts (D1, D2, D2) from Bard, reporting each independently and average scores across the three drafts.NLLB is our supervised baseline with source dialect as MSA.NLLB (Dia) is where the NLLB model is dialect-specific, rather than being an MSA model.SB -supervised baseline, Dia -dialect, Var -varieties, M -model, MST-Microsoft Translation, GT -Google Translate.

Table 8 :
A summary of related works.We provide a brief description of recent studies aimed at evaluating LLMs on MT tasks.MT -machine translation.TD -translation direction.ZS -zero-shot, FS -few-shot.

Table 9 :
Length statistics of the dataset across the different Arabic varieties.
(NLLB et al., 2022), and their Statistical Machine Translation (SMT) system with Google Neural Machine Translation (GNMT)Wu  et al. (2016b)featuring an LSTM with 8 encoder layers and 8 decoder ones with attention and residual connections.GNMT was trained on Google's internal datasets and it supports 133 languages.GNMT currently is powered by Transformers.Microsoft Translator.Microsoft's translation service uses an NMT model that supports 111 different languages.Amazon Translation.Amazon Web Services (AWS) offer batch translation with their NMT models that can translate to and from 75 languages.NLLB-200.No Language Left Behind (NLLB  et al., 2022)is an open-source Transformer model developed by META.It was trained onFLORES- 200 (NLLB et al., 2022), NLLB-MD(NLLB et al., 2022), and NLLB-Seed (NLLB et al., 2022)for a