XLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align), a technique to automatically mine cross-lingual entity lexica from mined web data. We demonstrate LSP-Align outperforms baselines at extracting cross-lingual entity pairs and mine 164 million entity pairs from 120 different languages aligned with English. We release these cross-lingual entity pairs along with the massively multilingual tagged named entity corpus as a resource to the NLP community.


Introduction
Named entities are references in natural text to real-world objects such as persons, locations, or organizations that can be denoted with a proper name. Recognizing and handling these named entities in many languages is a difficult, yet crucial, step to language-agnostic text understanding and multilingual natural language processing (NLP) (Sekine and Ranchhod, 2009). As such, cross-lingual named entity lexica can be invaluable resources towards making tasks such as entity linking, named entity recognition (Ren et al., 2016b,a), and information and knowledge base construction (Tao et al., 2014) inherently multilingual. However, the coverage of many such multilingual entity lexica (e.g., Wikipedia titles) is less complete for lower-resource languages and approaches to automatically generate them under-perform due to the poor performance of low-resource taggers (Feng et al., 2018;Cotterell and Duh, 2017).
To perform low-resource NER, previous efforts have applied word alignment techniques to project available labels to other languages. Kim et al. (2010) applies heuristic approaches with alignment correction using an alignment dictionary of entity mentions. Das and Petrov (2011) introduced a novel label propagation technique that creates a tag lexicon for the target language, while Wang and Manning (2014) instead projected model expectation rather than labels thus transferring word boundary uncertainty. Additional work jointly performs word alignment while training bilingual name tagging (Wang et al., 2013); however this method assumes the availability of named entity taggers in both languages. Other methods have leveraged bilingual embeddings for projection (Ni et al., 2017;Xie et al., 2018).
In this work, we propose using named-entity projection to automatically curate a large crosslingual entity lexicon for many language pairs. As shown Figure 1, we construct this resource by performing NER in a higher-resource language, then projecting the entities onto text in a lowerresource language using word-alignment models.
Our main contribution is the construction and release of a large web-mined cross-lingual entity dataset that will be beneficial to the NLP community. Our proposed alignment model, LSP-Align, principally combines the lexical, semantic, and phonetic signals to extract higher-quality cross-lingual entity pairs as verified on a ground-truth entity pair set. With LSP-Align, we mined over 164M distinct cross-lingual entity pairs spanning 120 language pairs and freely release the XLEnt dataset in hope it spurs further work in cross-lingual NLP.

Preliminaries
We formally define an entity collection as a collection of extracted text spans tied to named entity mentions. We denote these named entity mentions as = { } =1 , where is the ℎ named entity in the mention collection and is the size of .
Cross-lingual entity lexicon creation seeks to create two entity collections 1 and 2 in a source and target language respectively. These two collections should be generated such that for each entity mention in ∈ 1 in the source language, there is a corresponding named entity ∈ 2 in the target language such that and refer to the same named entity in their respective language.

Mining Cross-lingual Entities
We introduce our approach to automatically extract cross-lingual entity pairs from large mined corpora.

High-Resource NER
We begin with large collections of comparable bitexts mined from large multilingual web corpora (El-Kishky et al., 2020b). In particular, we select three mined web corpora 1) CCAligned (El-Kishky et al., 2020a), 2) WikiMatrix (Schwenk et al., 2019a), and 3) CCMatrix (Schwenk et al., 2019b)) due to the wide diversity of language pairs available in these mined corpora. We select language pairs of the form English-Target and tag each English sentence with named entity tags (Ramshaw and Marcus, 1999) using a pretrained NER tagger provided in the Stanza NLP toolkit (Qi et al., 2020). This NER model adopts a contextualized string representationbased tagger proposed by Akbik et al. (2018) and utilizes a forward and backward character-level LSTM language model. At tagging time, the representation at the end of each word position from both language models with word embeddings is fed into a standard Bi-LSTM sequence tagger with a conditional-random-field decoder.

Entity Projection via Word Alignment
We introduce three approaches for projecting entities and LSP-Align which combines all three.
FastAlign performs unsupervised word alignment over the full collection of mined bitexts using an expectation maximization based algorithm. While FastAlign is state-of-the-art in word alignment, due to its reliance on lexical co-occurences, it may misalign low-frequency entities.

Semantic Alignment
We leverage multilingual representations (embeddings) from the LASER toolkit (Artetxe and Schwenk, 2019) to align words that are semantically close. We propose a simple greedy word alignment algorithm guided by a distance function between words: (1)

Algorithm 1: Distance Word Alignment
where Equation 1 shows that the semantic distances between a source word ( ) and target word ( ) is simply 1 minus the cosine similarity between and , the LASER vector representations of and respectively. As shown in Algorithm 1, we take each source-target sentence pair and perform alignment between their tokens guided by the semantic distances between words. Of course, as source and target sentences, may be of different sizes, tokens in the shorter sentence may be aligned with multiple target tokens. Unlike lexical alignment with FastAlign, our distance-based alignment is deterministic and only needs a single pass through the bitexts.

Phonetic Alignment
Recognizing that in many cases, phonetic transliterations are the avenue by which proper names travel between languages, we propose using phonetic signals to perform alignment and match named entities.
To align words based on their phonetic similarity, we leverage the distances between their transliterations and align words between the source and target that are "close" in this phonetic space. We adopt an unsupervised transliteration system developed by (Chen and Skiena, 2016) to transliterate between source and target languages and utilize Levenshtein distance (aka edit distance) (Wagner and Fischer, 1974) to calculate distances between transliterated words: where ( ·, · ) is the Levenshtein distance between two strings and is the transliteration of word into word 's language. Equation 2 selects the minimum normalized distance between a source transliteration, target transliteration, and no transliteration to guide Algorithm 1 for a greedy word alignment. Once again, only a single pass over the data is required for alignment.

Estimating Translation Probabilities
Leveraging lexical alignment (i.e, FastAlign) alongside semantic and phoentic alignment yields three potential word alignments for a bitext collection. For alignment method , we can iterate through the alignments and compute the counts of sourceto-target ( , ) word pairings; we denote this count ( , ). We can estimate the maximum likelihood translation probability from to given by alignment method as follows: Using Equation 3, we can compute the translation probabilities for lexical, semantic, and phonetic alignments which we use in our LSP-Align model.

LSP Named-entity Projection
We describe LSP-Align, which combines the three alignment signals for better entity-pair mining. As described in Algorithm 2, the generative process takes in a source sentence and translates this sentence into the target sentence by drawing an alignment variable and translation mechanism (lexical, semantic, or phonetic) for each position in the target sentence and drawing a translated word from the corresponding translation distribution.
The graphical model for LSP-Align depicted in Figure 2, is similar to IBM-1 (Brown et al., 1993). The main difference is that, in addition to latent alignment variables , we introduce latent translation mechanisms . The translation distribution , is chosen based on the latent alignment and mechanism variables. As we demonstrate in Equation 3, we can leverage the alignments for each alignment signal to estimate , for each translation distribution. Using these estimated distributions in our model, we can infer the alignment variables as follows: where we assign the most probable alignment variable to each target word after marginalizing over the latent translation mechanisms (lexical, semantic, phonetic), which, for simplicity, we give equal probability.

Experiments & Results
Datasets We utilize a gold standard evaluation lexicon created by (Pan et al., 2017) that leverages eight named parallel entity corpora . We select nine languages from a diverse set of resource availability, language families, and scripts for evaluation.

Evaluation Protocol
We evaluated the performance of the methods using the commonly used fuzzy-f1 score (Tsai and Roth, 2018) which is defined as the harmonic mean of the fuzzy precision and fuzzy recall scores. This metric is based on the longest common subsequence between a gold and mined entity, and has been used for several years in the NEWS transliteration workshops (Li et al., 2009;Banchs et al., 2015). The fuzzy precision and recall between a predicted string and the correct string is computed as follows: where ( ·, · ) is the longest common subsequence between two strings.

Cross-lingual Entity Extraction
We take a small sample of parallel sentences for each language, mine entity pairs using each projection technique, and compute Fuzzy-F1 using the gold-standard as a reference. As seen in Table 1, while lexical alignment outperforms semantic alignment, it displays similar performance to phonetic with phonetic performing better on low-resource languages and lexical performing better on highresource. However, LSP-Align outperforms or matches lexical alignment consistently showing that using all signals yields superior NE projection. , separates the evaluated entities by frequency in the web-data bitexts (low=0-3, mid=4-10, high=11+), and shows LSP-Align outperforming FastAlign when the entity is infrequent in the corpus. However, as entity frequency follows a long-tailed distribution, most entity mentions are infrequent.
In Table 2, we evaluate the quality of our full XLEnt dataset. As a general trend, the quality of extracted entities is high-resource > mid-resource > low-resource. This is intuitive as there are more parallel sentences that are likely better aligned on a sentence-level yielding better word alignments.
In Figure 4, we show that filtering on a higher mined frequency improves the overall quality of the entity pairs (albeit yielding a smaller dictionary). This is also intuitive as the redundancy of an entity pair being mined multiple times in different sentence pairs signals it's likely a true translation. This suggests that tuning the frequency threshold can be a useful tool to control the quality of the resultant entity lexicon.

Conclusion
We propose a technique that combines lexical alignment, semantic alignment, and phonetic alignment  into a unified alignment model. We demonstrate this unified model better extracts cross-lingual entity pairs than any single alignment. Leveraging this model, we automatically curate a large, crosslingual entity lexicon covering 120 languages paired with English which we freely release to the community. Accompanying this lexicon, we release a large multilingual collection of sentences tagged via named-entity projection. We hope these resources facilitate future multilingual NLP work such as multilingual NER, multilingual entity linking, and multilingual knowledge base construction.