TICO-19: the Translation Initiative for Covid-19

The COVID-19 pandemic is the worst pandemic to strike the world in over a century. Crucial to stemming the tide of the SARS-CoV-2 virus is communicating to vulnerable populations the means by which they can protect themselves. To this end, the collaborators forming the Translation Initiative for COvid-19 (TICO-19) have made test and development data available to AI and MT researchers in 35 different languages in order to foster the development of tools and resources for improving access to information about COVID-19 in these languages. In addition to 9 high-resourced,"pivot"languages, the team is targeting 26 lesser resourced languages, in particular languages of Africa, South Asia and South-East Asia, whose populations may be the most vulnerable to the spread of the virus. The same data is translated into all of the languages represented, meaning that testing or development can be done for any pairing of languages in the set. Further, the team is converting the test and development data into translation memories (TMXs) that can be used by localizers from and to any of the languages.


Introduction
The COVID-19 pandemic marks the worst pandemic to strike the world since 1918. At the time of this writing, 3 the SARS-CoV-2 coronavirus responsible for COVID-19 has infected over ten million people worldwide, with over a half a million 1 Collaborators in the initiative include Translators without Borders, Carnegie Mellon University, Johns Hopkins University, Amazon Web Services, Appen, Facebook, Google, Microsoft, and Translated. 2 The dataset, translation memories, and additional resources are freely available online: http://tico-19.github.io/.
As the project continues and we create data for more languages, we will keep updating this paper as well as the project's website. 3 July 1st, 2020 deaths. While these numbers are likely underreported, they are growing at an alarming rate, and many millions of people could become infected or perish without proper prevention measures. Effective communication from health authorities is essential to protect at-risk populations, slow down the spread of the disease, and decrease its morbidity and mortality (UNOCHA, 2020). Yet, preventive measures such as stay-at-home orders, social distancing, and requirements to wear personal protective equipment (e.g. masks, gloves, etc.) have proven difficult to relay. That's not accounting for the difficulty to disseminate correct technical information about the disease, such as symptoms (e.g., fever, chills, etc.), specifics about testing (e.g., viral vs. antibody testing), and treatments (e.g., intubation, plasma transfusion).
While official communications from the World's Health Organization (WHO) are constantly published and revised, they are mostly limited to major languages. This has resulted in a vacuum in many languages that has been filled by an infodemic of misinformation, as described by the WHO. Non-governmental organizations (NGOs) such as Translators without Borders (TWB) play an important role in delivering multilingual communication in emergencies, such as the COVID-19 pandemic, but their reach and capacity has been outsized by the needs presented by the pandemic. To date, TWB has translated over 3.5 million words with over 80 non-profit organizations for more than 100 language pairs as part of their COVID-19 response.
Translation technologies such as automatic Machine Translation (MT) and Computer Assisted Translation (CAT) present unique opportunities to scale the throughput of human translators. However, given the sensitivity of the content, it is critical that the translations produced automatically are of the highest possible quality.
The Translation Initiative for COvid-19 (TICO-19) effort marks a unique collaboration between public and private entities that came together shortly after the beginning of the pandemic. 4 The focus of TICO-19 is to enable the translation of content related to COVID-19 into a wide range of languages. First, we make available a collection of translation memories and technical glossaries so that language service providers (LSPs), translators and volunteers can make use of them to expedite their work and ensure consistency and accuracy. Second, we provide an open-source, multi-lingual benchmark set (which includes data for very-lowresource languages) specialized in the medical domain, which is intended to track the quality of current machine translation systems, thus enabling future research in the area. Lastly, we provide monolingual and bi-lingual resources for MT practitioners to use in order to advance the state-of-the-art in medical and humanitarian Machine Translation, as well as other natural language processing (NLP) applications.
Our hope is that our work will in the shortterm enable the translation of important communications into multiple languages, and that in the long-term, it will serve to foster the research on MT for specialized content into low-resource languages. Through these resources we hope that our society is better prepared to quickly respond to the needs of translation in the midst of crises (e.g., for future crises, a la Lewis et al. (2011)).

The Value of Translation Technologies in Crisis Scenarios
During a crisis, whether it is local to one region or is a worldwide pandemic, communicating effectively in the languages and formats people understand is central to effective programs on the ground. For example, as part of the effort to control the spread of COVID-19, the Global Humanitarian Response Plan recognizes community engagement in relevant languages as a key strat-4 The World Health Organization (WHO) declared COVID-19 a pandemic on March 11th, 2020. The TICO-19 collaborators came together in the days following and first met as a group (over Zoom) on March 20th. It cannot be understated the rapidity with which this collaboration came together and how seamlessly the participants, many erstwhile competitors, have worked in harmony and without animosity. It is truly a testament to the needs of the greater good outweighing personal differences or potentially conflicting objectives. egy (UNOCHA, 2020). 5 In some countries, this will be all the more vital because information will be the main defense against the disease, and particular effort will be needed to make it accessible and grounded in local culture and context. Among these are countries where large sections of the population do not speak the dominant language.
Historically, MT, NLP and translation technologies have played a crucial role in crisis scenarios. The response to the Haitian earthquake in 2010 was notable for the broad use of technology in the humanitarian response, relying more on crowdsourced translations and geolocation, but notably, translation technology was also used. In the days following the earthquake, Haitian citizens were encouraged to text messages requesting assistance to "4636", and as many as 5,000 messages were texted to this number per hour. Unfortunately for the aid agencies, whose dominant languages were English and French (aid agencies included the US Navy, the Red Cross, and Doctors without Borders), most of the SMS messages were in Haitian Kreyòl. Quickly, the Haitian Kreyòl speaking diaspora around the world were activated by the Mission 4636 consortium to translate the SMS messages and geolocate (Munro, 2010), and the translated messages were handed off to aid agencies for triage and action. The Mission 4636 infrastructure included a high-precision rule-based MT (Lewis et al., 2011), and within days to weeks after the earthquake, statistical MT engines were brought online by Microsoft (Lewis, 2010) and Google. 6 Translation technology continues to be used in a variety of crisis and relief scenarios. Notable among these is Translators without Borders (TWB) use of translation memories for translating to a number of under-resourced languages in relief settings. Likewise, the Standby Task Force, 7 who are activated in a variety of relief settings, note the use of MT in various deployments around the world, e.g., for Urdu in the Pakistan earthquake of 2011 and for Spanish in the Ecuador earthquake in 2016. The EU funded INTERnAtional network on Crisis Translation (INTERACT) 8 project, started a couple of years before the COVID-19 pandemic, focused on crisis translation, specifically in health crises such as pandemics, with a focus on improving resilience in times of crises through communication, ultimately with the goal of reducing loss of life. 9 Likewise, during the current pandemic, several community-driven efforts have sprung up to fulfil the need for information communication. The Endangered Languages Project 10 , for example, has collected community-produced translations of public health information in hundreds of languages in various formats.
What is not tracked is the degree to which publicly available MT tools and resources are used in crisis and relief settings, e.g., translation apps and tools from Amazon, Google and Microsoft, or the translation feature built into Facebook (e.g., automatically translating posts). The authors suspect use may be broad, but there are no published accounts documenting just how broadly and how much these tools are used in crises. Tantalizing evidence of the use of publicly available tools was noted by Lewis et al. (2011) who documented traffic in the Microsoft Translator apps in the weeks following the Haitian earthquake: they noted that at least 5 percent of the Haitian Kreyòl traffic was relief-related. It is likely that Google's and Microsoft's apps are used even when cell phone infrastructure is unavailable or destroyed, since the tools permit users to download models to their devices so they can perform offline translations. 11 In crises, it is clear that organizations need the capacity to communicate critical information and key messages into the languages people understand, at speed and at scale. Crisis affected communities could access content in local languages through various channels such as SMS, online chatbots, or more traditional printed materials. Their questions and feedback can be used to refine content to better meet their needs. Likewise, relief agencies need access to SMS and other communiques in local languages in order to more effec-tively and equitably distribute aid.
MT can help the various actors to translate and disseminate essential communications in a timely manner without the need to wait on human translators. This is particularly important in low resourced languages where professional translators are not readily available. Domain-specific MT can also assist translators with the right terminology to convey the correct response and standardize concepts. Furthermore, people who are unable to understand major languages could get access to vital information (such as news sources, websites, etc) first hand via a MT-driven tool set.
However, to be useful for translating specialized content such as medical texts, we require that automatic translations be of the highest possible accuracy. To advance the research in Machine Translation, we require both high-quality benchmark sets and access to basic training resources, both monolingual and parallel. Likewise, translation memories in a broader set of languages can help localizers around the world translate into these languages. In the remainder of this paper we describe the resources created by the TICO-19 initiative, and some evaluations against them.

The TICO-19 Translation Benchmark
We created the TICO-19 benchmark with three criteria in mind: diversity, relevance and quality. First, we sampled from a variety of public sources containing COVID-19 related content, representing different domains. Second, to make our content relevant for relief organizations, we chose the languages to translate into based on the requests from relief organizations on-the-ground. Third, we established a stringent quality assurance process, to ensure that the content is translated according to the highest industry standard.

COVID-19 source data
The translation benchmark was created by combining English open-source data from various sources, listed in Table 1. We took special care to diversify the domains and sources of the data. We provide a concise summary here and detailed statistics for every source in Appendix A: • PubMed: we selected 6 COVID-19-related scientific articles from PubMed 12 for a total of 939 sentences. • Wikipedia: we selected 15 COVID-19related articles from the English Wikipedia 14 on topics ranging from responses to the pandemic, drug development, testing, and coronaviruses in general.
• Wikinews, Wikivoyage, Wikisource: 6 COVID-19-related entries from Wikinews. 15 one article from Wikivoyage 16 summarizing travel restrictions, and two entries from Wikisource 17 (an executive order and an internal Wikipedia communiqué). These data respectively cover the domains of news, travel advisories, and government/organization announcements.

Languages
We translated the above English data into 35 languages. 18 In some cases, this was achieved through pivot languages, i.e., the content was translated into the pivot language first (e.g., French, Farsi) and then translated into the target language (e.g., Congolese Swahili, Dari). The languages were selected according to various criteria, with the main consideration being the potential impact of our collected translations and the humanitarian priorities of TWB. The translation languages include:

Wikivoyage
Due to the spread of the disease, you are advised not to travel unless necessary, to avoid being infected, quarantined, or stranded by changing restrictions and cancelled flights.

Wikipedia
Drug development is the process of bringing a new infectious disease vaccine or therapeutic drug to the market once a lead compound has been identified through the process of drug discovery.

Wikisource
The federal government has identified 16 critical infrastructure sectors whose assets, systems, and networks, whether physical or virtual, are considered so vital to the United States that their incapacitation or destruction would have a debilitating effect on security, economic security, public health or safety, or any combination thereof. The latter two sets are primarily languages of Africa, and South and South-East Asia, whose communities, according to on-the-ground organizations, may be most susceptible to the spread of the virus and its potentially disastrous ramifications, mostly due to lack of access to information and communication in the community languages. They are also overwhelmingly under-resourced languages; in fact, some of the languages have remained untouched by the AI and MT communities, and have no known tools or resources that have been developed for them.
All of the test and development documents are sentence aligned across all of the languages, which allows for any pairing of languages for testing or development purposes. This was done by design, in order to facilitate tool and resource development in and across any of the targeted languages. For example, an MT developer could develop translation systems for French to/from Congolese Swahili, Arabic to/from Kurdish, Urdu to/from Pashto, Hindi to/from Marathi, Amharic to/from Oromo, or Chinese to/from Malay, among the 1296 possible pairings. Note that as the project continues and as we create data for more languages we will keep updating this paper as well as the project's website.

Quality Assurance
It has been observed that translation from and into low-resource languages requires additional automatic and manual quality checks (Guzmán et al., 2019). To obtain the highest possible quality, here we implemented a two-step human quality control process. First, each document is sent for translation to language service providers (LSP), where the translation is performed. After translation, the dataset goes through a process of editing, in which each sentence is thoroughly vetted by qualified professionals familiar with the medical domain, whenever available. 19 In case of discrepancies, a process of arbitration is followed to solve disagreements between translators and editors.
After editing, a selected fraction of the data (18%, 558 sentences) undergoes a second independent quality assurance process. To ensure quality in the hardest-to-translate data, the scientific medical content from PubMed was upsampled so that it comprises 329 of the 558 doubly-checked sentences (almost 59%). The exact documents that comprise our second quality assurance set are listed in Appendix B.
The quality of the translations was checked, and reworks were made until every translation set was rated above 95% across all languages, before any additional subsequent edits. Some low-resource languages like Somali, Dari, Khmer, Amharic, Tamil, Farsi, and Marathi required several rounds of translation to reach acceptable performance. The hardest part, unsurprisingly, proved to be the PubMed portion of the benchmark. Our QA process revealed that in most cases the problems arose when the translators did not have any medical expertise, which lead them to misunderstand the English source sentence and often opt for sub-par literal or word-for-word translations. We provide additional details with the estimated quality per language in the Appendix C. We note that all mistakes identified in this subset have been corrected in the final released dataset, and that all sentences that underwent the QA process are part of the test portion of our benchmark.
We additionally release the sampled dataset along with detailed error annotations and corrections. Whenever an error was noted in the validation sample, it was classified as one of the following categories: Addition/Omission, Grammar, Punctuation, Spelling, Capitalization, Mistranslation, Unnatural translation, and Untranslated text. The severity of the error was also classified as minor, major, or critical. Although small in size (at most 558 sentences in each translation direction), we hope that releasing these annotations will also invite automatic quality estimation and post-editing research for diverse under-resourced languages.

Translator Resources
Translation Memories Because of the breadth of languages covered by TICO-19, and the fact that so many are under-resourced, the translations themselves can be of significant value to localizers. As part of the effort, the TICO-19 collaborators have converted 20 the translated data to translation memories, cast as TMX files, for all English-X pairings, as well as some other pairings of languages focusing on potential local needs (e.g. French-Congolese Swahili, Farsi-Dari, and Kurdish Kurmanji-Sorani). These TMX files, in addition to the test and development data, have been made available to the public through the project's website.
Terminologies Two sets of translation terminologies were provided by Facebook and Google (the complete set of the English source terms and of the translated languages are listed in Appendix D): • the Facebook one includes 364 COVID-19 related terms translated in 92 languages/locales.
• the Google one includes 300 COVID-19 related terms translated from English to 100 languages and a total of 1300 terms from 27 languages translated into English (for a total of approximately 30k terms).
Additional Translations Translators without Borders (TWB) worked with its network of translators to provide translations in hard-to-source languages (e.g. Congolese Swahili and Kanuri). It also provided COVID-19 specific sources from its diverse humanitarian partners to augment the dataset. This augmented dataset will be available under license on TWB's Gamayun portal. 21

MT Developer Resources
As part of our project the CMU team also collected some COVID-19-related monolingual data in multiple languages. They are available online, 22 but we note that some of these data might not be available under the same license as our datasets (and hence might not be appropriate for commercial system development). These are detailed in the next sections.

Monolingual
Wikipedia Data COVID-19-related data from Wikipedia were scraped in 37 languages. COVID-19 terms were used as queries (language specific, in most cases), retrieving the textual data from the returned articles (i.e. stripping out any Wikipedia markup, metadata, images, etc). The data ranges from less than 1K sentences (around 10K tokens) for languages like Hindi, Bengali, or Afrikaans, and for as much as 8K sentences for Spanish (160K tokens) or Hungarian (120K tokens).
News Data COVID-19-related news articles (as identified by keyword search) were scraped from three news organizations that publish multilingually through their world services. Specifically, the collected data include articles from the BBC World Service 23 (22 languages), the Voice of America 24 (31 languages), and the Deutsche Welle 25 (29 languages).

Parallel
We have also scraped a very small amount of available parallel data, mostly from public service announcements from NGOs and national/state government sources. Specifically, we scraped Public Service Announcements by the Canadian government 26 in 21 languages (English, French, and First Nations Languages), a fact sheet provided by the King County (Washington, USA) 27 in 12 languages, a COVID-19 advice sheet from the Doctors of the World 28 in 47 languages, and data from the COVID-19 Myth Busters in World Languages project 29 in 28 languages, and a medical prevention and treatment handbook from Zhejiang University School of Medicine 30 in 10 languages. Unfortunately the total amount of data from these sources do not exceed a few hundred sentences in each direction, so they are not enough for system development; they could, though, be potentially useful as an additional smaller evaluation set or for terminology extraction.

Baseline Results and Discussion
We present baseline results in some language directions, using the following systems: For translation between English and Chinese we also use a system trained on WMT'18 data (Bojar et al., 2018).
3. We train systems between English and ar, fa, mr, om, zu on publicly available corpora from OPUS (referred to as "our OPUS models").
4. We train multilingual systems between English and hi, ms, and ur on a TED talks dataset (Qi et al., 2018) ("our Multilingual TED models").
Results Table 3 presents results in translation from English to all languages whose MT systems we were able to train or use, while Table 4 includes results in the opposite directions. Similarly, Tables 5 and 6 shows the quality of MT systems, as measured on our test set, from and to French for a few languages. All tables also include a breakdown of the quality for each test domain. 31 Discussion First and foremost, the main takeaway from these baseline results lie not in the above-mentioned Tables, but in the languages that are not present in them. We were unable to find either pre-trained MT systems or publicly available parallel data in order to train our own baselines for Dari, Pashto, Tigrinya, Nigerian Fulfulde, Kurdish Sorani,Myanmar,Oromo,Dinka,Nuer,and isiZulu. 32 This highlights the need for serious data collection efforts to expand the availability of data for large swathes of under-represented communities and languages. Beyond this obvious limitation, the existing systems' results highlight the divide between highresource language pairs and low-resource ones. For all European languages (Spanish, French, Portuguese, Russian) as well as for Chinese and Indonesian, MT produces very competitive results with BLEU scores between 25 and 49. 33 In contrast, the output translations for languages like Lingala, Luganda, Marathi, or Urdu are quite disap-31 Note, however, that each sub-domain posits a smaller test set than the complete set, and hence any result should properly take into account statistical significance measures. 32 Note that although a small amount of parallel data exists for English-isiZulu, Abbott and Martinus (2019) report very low results on general benchmarks as the parallel data requires cleaning. 33 Note that BLEU scores on test sets on different languages (e.g. in Tables 3 and 5) are not directly comparable. pointing, with extremely low BLEU scores under 10. The existence of pre-trained systems or of parallel data, hence, is not enough; this level of quality is basically unusable for any real-world deployment either for translators or for end-users.
A comparison of the results across different domains is also revealing. BLEU scores are generally higher on Wikipedia and news articles; this is unsurprising, as most MT systems rely on such domains for training, as they naturally produce parallel or quasi-parallel data. Our PubMed data pose a more challenging setting, but perhaps not as challenging as we initially expected, although the results vary across languages. In translating from English to French, for instance, the difference between Wikipedia and PubMed is more than 14 BLEU points, 34 while the differences are smaller for e.g. Indonesian-English (6 BLEU points) or Russian-English (4 BLEU points).
Future Work Several concrete steps have the potential to improve MT for all languages in our benchmark. All results we report are with MT systems trained on general domain data or in particularly out-of-domain data (such as TED talks); domain adaptation techniques using small in-domain parallel resources or monolingual source-or targetside data should be able to increase performance. Incorporating the terminologies as part of the training and the inference schemes of the models could also ensure faithful and consistent translations of the COVID-19-specific scientific terms that might not naturally appear in other training data or might appear in different contexts.
Another direction for improvement involves multilingual NMT models trained on massive webbased corpora (Aharoni et al., 2019), which have improved translation accuracy particularly for languages in the lower end of data availability. Also viable are methods relying on multilingual model transfer, which can target languages with extremely small amounts of data, as in Chen et al. , they don't cover enough low-resource languages. We hope that the 34 We note again that these scores are not directly comparable as the underlying test data are different. availability of multilingual representations such as Multilingual BERT (Devlin et al., 2018) and XML-R (Conneau et al., 2019) will empower the creation of parallel corpora for low resource languages through low-resource corpus filtering (Koehn et al., 2019) or other approaches.

Conclusion
Enabling efficient and accurate communication through translations still has a ways to go for the majority of the world's languages and particularly the most vulnerable ones. With this effort we only address a fraction of the needs for a fraction of the world's languages. Nevertheless, we hope that the MT Resources that we release will have an immediate impact for the languages we cover. More importantly, the benchmark we release will allow the MT research community, both academic and industrial, to be more prepared for the next crisis where translation technologies will be needed.

B Quality Assurance Documents
The 558 sentences of our quality assurance set are comprised of examples from almost all subdomains of the corpus. Specifically, it includes 40 sentences from the conversational data, one PubMed document (PubMed 8), two of the Wikinews documents (Wikinews 1, Wikinews 3) and one complete Wikipedia article (Wikipedia handpicked 4).

C Expected Quality per Language
For all translation directions the quality was quite good, with average quality scores above 95%. The detailed list of the quality evaluations on our sampled documents, which can be considered a proxy for the overall translation quality of our whole dataset, is available in Table 8 Average QA score (final): Table 8: QA score across languages. We report the initial QA score and whether re-work was required on the produced translations (beyond the QA sample, due to serious translation errors). In cases where the initial QA yielded very poor results, the translations were corrected in their entirety and a new QA process was performed, and we report both the initial and the final QA results. Note: a "N/A" final QA score in the score indicates that the final score is temporarily not available and we will report it in an updated version of the paper.