Creating Lexical Resources for Endangered Languages

This paper examines approaches to generate lexical resources for endangered languages. Our algorithms construct bilingual dictionaries and multilingual thesauruses using public Wordnets and a machine translator (MT). Since our work relies on only one bilingual dictionary between an endangered language and an"intermediate helper"language, it is applicable to languages that lack many existing resources.


Introduction
Languages around the world are becoming extinct at a record rate. The Ethnologue organization 1 reports 424 languages as nearly extinct and 203 languages as dormant, out a total of 7,106 recorded languages. Many other languages are becoming endangered, a state which is likely to lead to their extinction, without determined intervention. According to UNESCO, "a language is endangered when its speakers cease to use it, use it in fewer and fewer domains, use fewer of its registers and speaking styles, and/or stop passing it on to the next generation...". In America, UNESCO reports 134 endangered languages, e.g., Arapaho, Cherokee, Cheyenne, Potawatomi and Ute.
One of the hallmarks of a living and thriving language is the existence and continued production of "printed" (now extended to online presence) resources such as books, magazines and educational materials in addition to oral traditions. There is some effort afoot to document record and archive endangered languages. Documentation may involve creation of dictionaries, thesauruses, text and speech corpora. One possible way to resuscitate these languages is to make them more easily learnable for the younger generation. To learn languages and use them well, tools such as dictionaries and thesauruses are essential. Dictionaries are resources that empower the users and learners of a language. Dictionaries play a more substantial role than usual for endangered languages and are "an instrument of language maintenance" (Gippert et al., 2006). Thesauruses are resources that group words according to similarity (Kilgarriff, 2003). For speakers and students of an endangered language, multilingual thesauruses are also likely to be very helpful.
This study focuses on examining techniques that leverage existing resources for "resourcerich" languages to build lexical resources for lowresource languages, especially endangered languages. The only resource we need is a single available bilingual dictionary translating the given endangered language to English. First, we create a reverse dictionary from the input dictionary using the approach in (Lam and Kalita, 2013). Then, we generate additional bilingual dictionaries translating from the given endangered language to several additional languages. Finally, we discuss the first steps to constructing multilingual thesauruses encompassing endangered and resources-rich languages. To handle the word sense ambiguity problems, we exploit Wordnets in several languages. We experiment with two endangered languages: Cherokee and Cheyenne, and some resource-rich languages such as English, Finnish, French and Japanese 2 . Cherokee is the Iroquoian language spoken by 16,000 Cherokee people in Oklahoma and North Carolina. Cheyenne is a Native American language spoken by 2,100 Cheyenne people in Montana and Oklahoma.
The remainder of this paper is organized as follows. Dictionaries and thesauruses are introduced in Section 2. Section 3 discusses related work. In Section 4 and Section 5, we present approaches for creating new bilingual dictionaries and multilingual thesauruses, respectively. Experiments are described in Section 6. Section 7 concludes the paper.

Dictionaries vs. Thesauruses
A dictionary or a lexicon is a book (now, in electronic database formats as well) that consists of a list of entries sorted by the lexical unit. A lexical unit is a word or phrase being defined, also called definiendum. A dictionary entry or a lexical entry simply contains a lexical unit and a definition (Landau, 1984). Given a lexical unit, the definition associated with it usually contains parts-ofspeech (POS), pronunciations, meanings, example sentences showing the use of the source words and possibly additional information. A monolingual dictionary contains only one language such as The Oxford English Dictionary 3 while a bilingual dictionary consists of two languages such as the English-Cheyenne dictionary 4 . A lexical entry in the bilingual dictionary contains a lexical unit in a source language and equivalent words or multiword expressions in the target language along with optional additional information. A bilingual dictionary may be unidirectional or bidirectional.
Thesauruses are specialized dictionaries that store synonyms and antonyms of selected words in a language. Thus, a thesaurus is a resource that groups words according to similarity (Kilgarriff, 2003). However, a thesaurus is different from a dictionary. (Roget, 1911) describes the organizes of words in a thesaurus as "... not in alphabetical order as they are in a dictionary, but according to the ideas which they express.... The idea being given, to find the word, or words, by which that idea may be most fitly and aptly expressed. For this purpose, the words and phrases of the language are here classed, not according to their sound or their orthography, but strictly according to their signification". Particularly, a thesaurus contains a set of descriptors, an indexing language, a classification scheme or a system vocabulary (Soergel, 1974). A thesaurus also consists of relationships among descriptors. Each descriptor is a term, a notation or another string of symbols used to designate the concept. Examples of thesauruses are Roget's international Thesaurus (Roget, 2008), the Open Thesaurus 5 or the one at thesaurus.com.
We believe that the lexical resources we create are likely to help endangered languages in several ways. These can be educational tools for language learning within and outside the community of speakers of the language. The dictionaries and thesauruses we create can be of help in developing parsers for these languages, in addition to assisting machine or human translators to translate rich oral or possibly limited written traditions of these languages into other languages. We may be also able to construct mini pocket dictionaries for travelers and students.

Related work
Previous approaches to create new bilingual dictionaries use intermediate dictionaries to find chains of words with the same meaning. Then, several approaches are used to mitigate the effect of ambiguity. These include consulting the dictionary in the reverse direction (Tanaka and Umemura, 1994) and computing ranking scores, variously called a semantic score (Bond and Ogura, 2008), an overlapping constraint score, a similarity score (Paik et al., 2004) and a converse mapping score (Shaw et al., 2013). Other techniques to handle the ambiguity problem are merging results from several approaches: merging candidates from lexical triangulation (Gollins and Sanderson, 2001), creating a link structure among words (Ahn and Frampton, 2006) and building graphs connecting translations of words in several languages (Mausam et al., 2010). Researchers also merge information from several sources such as bilingual dictionaries and corpora (Otero and Campos, 2010) or a Wordnet (István andShoichi, 2009) andKalita, 2013). Some researchers also extract bilingual dictionaries from corpora (Ljubešić and Fišer, 2011) and (Bouamor et al., 2013). The primary similarity among these methods is that either they work with languages that already possess several lexical resources or these approaches take advantage of related languages (that have some lexical resources) by using such languages as intermediary. The accuracies of bilingual dictionaries created from several available dictionaries and Wordnets are usually high. However, it is expensive to create such original lexical resources and they do not always exist for many languages. For instance, we cannot find any Wordnet for chr or chy. In addition, these existing approaches can only generate one or just a few new bilingual dictionaries from at least two existing bilingual dictionaries. (Crouch, 1990) clusters documents first using a complete link clustering algorithm and generates thesaurus classes or synonym lists based on user-supplied parameters such as a threshold similarity value, number of documents in a cluster, minimum document frequency and specification of a class formation method. (Curran and Moens, 2002a) and (Curran and Moens, 2002b) evaluate performance and efficiency of thesaurus extraction methods and also propose an approximation method that provides for better time complexity with little loss in performance accuracy. (Ramírez et al., 2013) develop a multilingual Japanese-English-Spanish thesaurus using freely available resources: Wikipedia and Wordnet. They extract translation tuples from Wikipedia from articles in these languages, disambiguate them by mapping to Wordnet senses, and extract a multilingual thesaurus with a total of 25,375 entries.
One thing to note about all these approaches is that they are resource hungry. For example, (Lin, 1998) works with a 64-million word English corpus to produce a high quality thesaurus with about 10,000 entries. (Ramírez et al., 2013) has the entire Wikipedia at their disposal with millions of articles in three languages, although for experiments they use only about 13,000 articles in total. When we work with endangered or low-resource languages, we do not have the luxury of collecting such big corpora or accessing even a few thousand articles from Wikipedia or the entire Web. Many such languages have no or very limited Web presence. As a result, we have to work with whatever limited resources are available.

Creating new bilingual dictionaries
A dictionary Dict(S,T) between a source language S and a target language T has a list of entries. Each entry contains a word s in the source language S, part-of-speech (POS) and one or more translations in the target language T. We call such a translation t. Thus, a dictionary entry is of the form <s i ,POS,t i1 >, <s i ,POS,t i2 >, .... This section examines approaches to create new bilingual dictionaries for endangered languages from just one dictionary Dict(S,I), where S is the endangered source language and I is an "intermediate helper" language. We require that the language I has an available Wordnet linked to the Princeton Wordnet (PWN) (Fellbaum, 1998). Many endangered languages have a bilingual dictionary, usually to or from a resource-rich language like French or English which is the intermediate helper language in our experiments. We make an assumption that we can find only one unidirectional bilingual dictionary translating from a given endangered language to English.

Generating a reverse bilingual dictionary
Given a unidirectional dictionary Dict(S,I) or Dict(I,S), we reverse the direction of the entries to produce Dict(I,S) or Dict(S,I), respectively. We apply an approach called Direct Reversal with Similarity (DRwS), proposed in (Lam and Kalita, 2013) to create a reverse bilingual dictionary from an input dictionary.
The DRwS approach computes the distance between translations of entries by measuring their semantic similarity, the so-called simValue. The sim-Value between two phrases is calculated by comparing the similarity of the ExpansionSet for every word in one phrase with ExpansionSet of every word in the other phrase. An ExpansionSet of a phrase is a union of the synset, synonym set, hyponym set, and/or hypernym set of every word in it. The synset, synonym, hyponym and hypernym sets of a word are obtained from PWN. The greater is the simValue between two phrases, the more semantically similar are these phrases. According to (Lam and Kalita, 2013), if the simValue is equal to or greater than 0.9, the DRwS approach produces the "best" reverse dictionary.
For creating a reverse dictionary, we skip entries with multiword expression in the translation. Based on our experiments, we have found that approach is successful and hence, it may be an effective way to automatically create a new bilingual dictionary from an existing one. Figure 1 presents an example of generating entries for the reverse dictionary.

Building bilingual dictionaries to/from additional languages
We propose an approach using public Wordnets and MT to create new bilingual dictionaries Dict(S,T) from an input dictionary Dict(S,I). As previously mentioned, I is English in our exper- Figure 1: Example of creating entries for a reverse dictionary Dict(eng,chr) from Dict(chr,eng). The simValue between the words "ocean" and "sea" is 0.98, which is greater than the threshold of 0.90. Therefore, the words "ocean" and "sea" in English are hypothesized to have both meanings "amequohi" and "ustalanali" in Cherokee. We add these entries to Dict(eng, chr).
iments. Dict(S,T) translates a word in an endangered language S to a word or multiword expression in a target language T. In particular, we create bilingual dictionaries for an endangered language S from a given dictionary Dict(S,eng). Figure 2 presents the approach to create new bilingual dictionaries. For each entry pair (s,e) in a given dictionary Dict(S,eng), we find all synonym words of the word e to create a list of synonym words in English: SY N eng . SY N eng of the word eng is obtained from the PWN. Then, we find all syn-onyms of words belonging to SY N eng in several non-English languages to generate SY N L , L ∈ {f in, f ra, jpn}. SY N L in the language L is extracted from the publicly available Wordnet in language L linked to the PWN. Next, translation candidates are generated by translating all words in SY N L , L ∈ {eng, fin, fra, jpn} to the target language T using an MT. A translation candidate is considered a correct translation of the source word in the target language if its rank is greater than a threshold. For each word s, we may have many candidates. A translation candidate with a higher rank is more likely to become a correct translation in the target language. The rank of a candidate is computed by dividing its occurrence count by the total number of candidates. Figure 3 shows an example of creating entries for Dict(chr,vie), where vie is Vietnamese, from Dict(chr,eng).

Figure 3: Example of generating new entries for
Dict(chr,vie) from Dict(chr,eng). The word "ayvtseni" in chr is translated to "throat" in eng. We find all synonym words for "throat" in English to generate SY N eng and all synonyms in fin, fra and jpn for all words in SY N eng . Then, we translate all words in all SY N L s to vie and rank them. According to rank calculations, the best translations of "ayvtseni" in chr are the words "cổ họng" and "họng" in vie.
As previously mentioned, we want to generate a multilingual thesaurus THS composed of endangered and resource-rich languages. For example, we build the thesaurus encompassing an endangered language S and eng, fin, fra and jpn. Our thesaurus contains a list of entries. Every entry has a unique ID. Each entry is a 7-tuple: ID, SY N S , SY N eng , SY N f in , SY N f ra , SY N jpn and POS. Each SY N L contains words that have the same sense in language L. All SY N L , L ∈ {S, eng, fin, fra, jpn} with the same ID have the same sense.
This section presents the initial steps in constructing multilingual thesauruses using Wordnets and the bilingual dictionaries we create. The approach to create a multilingual thesaurus encompassing an endangered language and several resource-rich languages is presented in Figure 4 and Algorithm 1. First, we extract SY N L in resource-rich languages from Wordnets. To extract SY N eng , SY N f in , SY N f ra and SY N jpn , we use PWN and Wordnets linked to the PWN provided by the Open Multilingual Wordnet 6 project (Bond and Foster, 2013): FinnWordnet (FWN) (Lindén, 2010), WOLF (WWN) (Sagot and Fišer, 2008) and JapaneseWordnet (JWN) (Isahara et al., 2008). For each Offset-POS, we extract its corresponding synsets from PWN, FWN, WWN and 6 http://compling.hss.ntu.edu.sg/omw/ JWN to generate SY N eng , SY N f in , SY N f ra and SY N jpn (lines 7-10). The POS of the entry is the POS extracted from the Offset-POS (line 5). Since these Wordnets are aligned, a specific offset-POS retrieves synsets that are equivalent sensewise. Then, we translate all SY N L s to the given endangered language S using bilingual dictionaries we created in the previous section (lines [11][12][13][14]. Finally, we rank translation candidates and add the correct translations to SY N S (lines 15-19). The rank of a candidate is computed by dividing its occurrence count by the total number of candidates. If a candidate has a rank value greater than a threshold, we accept it as a correct translation and add it to SY N S .  We extract words belonging to offset-POS "09426788-n" in PWN, FWN, WWN and JWN and add them into corresponding SY N L . The POS of this entry is "n", which is a "noun". Next, we use the bilingual dictionaries we cre- ated to translate all words in SY N eng , SY N f in , SY N f ra , SY N jpn to the given endangered language, Cherokee, and rank them. According to the rank calculations, the best Cherokee translation is the word "ustalanali". The new entry added to the multilingual thesaurus is presented in Figure 6.

Experimental results
Ideally, evaluation should be performed by volunteers who are fluent in both source and destination languages. However, for evaluating created dictionaries and thesauruses, we could not recruit any individuals who are experts in two corresponding languages. We are in the process of finding volunteers who are fluent in both languages for some selected resources we create.

Datasets used
We start with two bilingual dictionaries: Dict(chr,eng) 7 and Dict(chy,eng) 8 that we obtain from Web pages. These are unidirectional bilingual dictionaries. The numbers of entries in Dict(chr,eng) and Dict(chy,eng) are 3,199 and 28,097, respectively. For entries in these input dictionaries without POS information, our algorithm chooses the best POS of the English word, which may lead to wrong translations. The Microsoft Translator Java API 9 is used as another main resource. We were given free access to this API. We could not obtain free access to the API for the Google Translator.
The synonym lexicons are the synsets of PWN, FWN, JWN and WWN.

Creating reverse bilingual dictionaries
From Dict(chr,eng) and Dict(chy,eng), we create two reverse bilingual dictionaries Dict(eng,chr) with 3,538 entries and Dict(eng,chy) with 28,072 entries Next, we reverse the reverse dictionaries we produce to generate new reverse of the reverse (RR) dictionaries, then integrate the RR dictionaries with the input dictionaries to improve the sizes of dictionaries. During the process of generating new reverse dictionaries, we already computed the semantic similarity values among words to find words with the same meanings. We use a simple approach called the Direct Reversal (DR) approach in (Lam and Kalita, 2013) to create these RR dictionaries. To create a reverse dictionary Dict(T,S), the DR approach takes each entry <s,POS,t> in the input dictionary Dict(S,T) and simply swaps the positions of s and t. The new entry <t,POS,s> is added into Dict(T,S). Figure 7 presents an example. Figure 7: Given a dictionary Dict(chy,eng), we create a new Dict(eng,chy) using the DRwS approach of (Lam and Kalita, 2013). Then, we create a new Dict(chy,eng) using the DR approach from the created dictionary Dict(eng,chy). Finally, we integrate the generated dictionary Dict(chy,eng) with the input dictionary Dict(chy,eng) to create a new dictionary Dict(chy,eng) with a greater number of entries The number of entries in the integrated dictionaries Dict(chr,eng) and Dict(chy,eng) are 3,618 and 47,529, respectively. Thus, the number of entries in the original dictionaries have "magically" increased by 13.1% and 69.21%, respectively.

Creating additional bilingual dictionaries
We can create dictionaries from chr or chy to any non-eng language supported by the Microsoft Translator, e.g., Arabic (arb), Chinese (cht), Catalan (cat), Danish (dan), German (deu), Hmong Daw (mww), Indonesian (ind), Malay (zlm), Thai (tha), Spanish (spa) and vie. Table 2 presents the number of entries in the dictionaries we create. These dictionaries contain translations only with the highest ranks for each word.
Although we have not evaluated entries in the particular dictionaries in Table 1, evaluation of dictionaries with non-endangered languages, but using the same approach, we have confidence that these dictionaries are of acceptable, if not very good quality.

Creating multilingual thesauruses
We construct two multilingual thesauruses: T HS 1 (chr, eng, fin, fra, jpn) and T HS 2 (chy, eng, fin, fra, jpn). The number of entries in T HS 1 and T HS 2 are 5,073 and 10,046, respectively. These thesauruses we construct contain words with rank values above the average. A similar approach used to create Wordnet synsets (Lam et al., 2014) has produced excellent results. We believe that our thesauruses reported in this paper are of acceptable quality.

How to evaluate
Currently, we are not able to evaluate the dictionaries and thesauruses we create. In the future, we expect to evaluate our work using two methods. First, we will use the standard approach which is human evaluation to evaluate resources as previously mentioned. Second, we will try to find an additional bilingual dictionary translating from an endangered language S (viz., chr or chy) to another "resource-rich" non-English language (viz., fin or fra), then, create a new dictionary translating from S to English using the approaches we have introduced. We plan to evaluate the new dictionary we create, say Dict(chr,eng) against the existing dictionary Dict(chr,eng).

Conclusion and future work
We examine approaches to create bilingual dictionaries and thesauruses for endangered languages from only one input dictionary, publicly available Wordnets and an MT. Taking advantage of available Wordnets linked to the PWN helps reduce ambiguities in dictionaries we create. We run experiments with two endangered languages: Cherokee and Cheyenne. We have also experimented with two additional endangered languages from Northeast India: Dimasa and Karbi, spoken by about 115,000 and 492,000 people, respectively. We believe that our research has the potential to increase the number of lexical resources for languages which do not have many existing resources to begin with. We are in the process of creating reverse dictionaries from bilingual dictionaries we have already created. We are also in the process of creating a Website where all dictionaries and thesauruses we create will be available, along with a user friendly interface to disseminate these resources to the wider public as well as to obtain feedback on individual entries. We will solicit feedback from communities that use the languages as mother-tongues. Our goal will be to use this feedback to improve the quality of the dictionaries and thesauruses. Some of resources we created can be downloaded from http://cs.uccs.edu/∼linclab/projects.html