Is “moby dick” a Whale or a Bird? Named Entities and Terminology in Speech Translation

Automatic translation systems are known to struggle with rare words. Among these, named entities (NEs) and domain-specific terms are crucial, since errors in their translation can lead to severe meaning distortions. Despite their importance, previous speech translation (ST) studies have neglected them, also due to the dearth of publicly available resources tailored to their specific evaluation. To fill this gap, we i) present the first systematic analysis of the behavior of state-of-the-art ST systems in translating NEs and terminology, and ii) release NEuRoparl-ST, a novel benchmark built from European Parliament speeches annotated with NEs and terminology. Our experiments on the three language directions covered by our benchmark (en→es/fr/it) show that ST systems correctly translate 75–80% of terms and 65–70% of NEs, with very low performance (37–40%) on person names.


Introduction
The translation of rare words is one of the main challenges for neural machine translation (NMT) models (Sennrich et al., 2016;Koehn and Knowles, 2017). Among rare words, named entities (NEs) and terminology are particularly critical: not only are they important to understand the meaning of a sentence (Li et al., 2013), but they are also difficult to handle due to the small number of valid translation options. While common words can be rendered in the target language with synonyms or paraphrases, NEs and terminology offer less expressive freedom, which is typically limited to one valid option. Under these conditions, translation errors often result in blatant (meaningless, hilarious, or even offensive) errors, which jeopardize users' trust in the translation system. One example is "moby dick" (in lower case, as in the typical output of a speech recognition system): Google Translate 1 returns mazikó poulí (massive bird) for 1 Accessed on the 27th April 2021.
Greek, while the translation contains profanities for other languages like Hungarian or Italian.
The problem is even more challenging in automatic speech recognition (ASR) and speech-to-text translation (ST), where a lookup into KGs or dictionaries is not feasible due to the different modality of the input (an audio signal rather than text). As regards NEs, the few existing studies (Ghannay et al., 2018;Caubrière et al., 2020) are all limited to ASR, for which two benchmarks are available (Galibert et al., 2014;Yadav et al., 2020), while suitable benchmarks do not even exist for ST. The situation is similar for terminology: few annotated test sets exist for MT (Dinu et al., 2019;Scansani et al., 2019;Bergmanis and Pinnis, 2021), but none for ST, which so far has remained unexplored.
In light of the above, the contribution of this work is twofold: (1) we present the first investigation on the behavior of state-of-the-art ST systems in translating NEs and terms, discussing their weaknesses and providing baseline results for future comparisons; (2) we release the annotated data that made our study possible. Our test set -NEuRoparl-ST -is derived from Europarl-ST (Iranzo-Sánchez et al., 2020) and covers three language pairs: en→es/fr/it. It relies on the Europarl-ST (audio, transcript, translation) triplets and en-riches their textual portions with NE and terminology annotation. Besides being the first benchmark of this type for ST, it can also be used for the evaluation of NE/terminology recognition (ASR) and translation (MT). The dataset is available at: ict.fbk.eu/neuroparl-st/.

Speech Translation Models
Our goal is to assess the capability of state-of-theart ST systems to properly translate NEs and terminology present in an utterance. To this aim, we compare instances of the two main approaches. One is the traditional cascade approach (Stentiford and Steer, 1988;Waibel et al., 1991), which consists of a pipeline where an ASR model produces a transcript of the input audio and an MT model generates its translation. The other is the so-called direct approach (Bérard et al., 2016;Weiss et al., 2017), which relies on a single neural network that maps the audio into target language text bypassing any intermediate symbolic representation. The two approaches have inherent strengths and weaknesses (Sperber and Paulik, 2020). Cascade solutions can exploit sizeable datasets for the ASR and MT subcomponents, but rely on a complex architecture prone to error propagation. Direct models suffer from the paucity of training data, but avoid error propagation and can take advantage of unmediated access to audio information (e.g. prosody) during the translation phase. In recent years, after a long dominance of the cascade paradigm, the initially huge performance gap between the two approaches has gradually closed (Ansari et al., 2020).
Our cascade system integrates competitive Transformer-based (Vaswani et al., 2017) ASR and MT components built from large training corpora. Specifically, the ASR model is trained on LibriSpeech (Panayotov et al., 2015), TEDLIUM v3 (Hernandez et al., 2018 and Mozilla Common Voice, 2 together with (utterance, transcript) pairs extracted from MuST-C (Cattoni et al., 2021), Europarl-ST (Iranzo-Sánchez et al., 2020, and CoVoST 2  ST corpora. ASR outputs are post-processed to add true-casing and punctuation. The MT model is trained on data collected from the OPUS repository, 3 amounting to about 19M, 28M, and 45M parallel sentence pairs respectively for en-es, en-fr, and en-it. Our direct model has the same Transformer-based architecture of the ASR component used in the cascade system. It exploits data augmentation and knowledge transfer techniques successfully applied by participants in the IWSLT-2020 evaluation campaign (Ansari et al., 2020;Potapczyk and Przybysz, 2020;Gaido et al., 2020a) and it is trained on MuST-C, Europarl-ST and synthetic data (∼1.5M pairs for each language direction). Systems' performance is shown in Table 2 and discussed in Section 4. Complete details about their implementation and training procedures are provided in the Appendix. All the related code is available at https://github.com/mgaido91/ FBK-fairseq-ST/tree/emnlp2021.

Evaluation Data: NEuRoparl-ST
To the best of our knowledge, freely available NE/term-labelled ST benchmarks suitable for our analysis do not exist. The required resource should contain i) the audio corresponding to an utterance, ii) its transcript, iii) its translation in multiple target languages (three in our case), and iv) NE/term annotation in both transcripts and target texts. Currently available MT, ST, ASR, NE and terminology datasets lack at least one of these key components. For example, most MT corpora (e.g. Europarl) lack both the audio sources and NE/terminology annotations. The very few available MT corpora annotated with NE/terminology still lack the audio portion, and extending them to ST would require generating synthetic audio, which is known to be problematic for models' performance. For these reasons, we preferred to create a the en→es/fr/it transcripts and translations of the Europarl-ST test sets, which are mainly derived from the same original speeches. The result is a multilingual benchmark featuring very high content overlap, thus enabling cross-lingual comparisons. NE annotation. We used the 18 tags and the annotation scheme defined by the guidelines ("OntoNotes Named Entity Guidelines -Version 14.0") used to annotate the OntoNotes5 corpus (Weischedel et al., 2012). The annotation was carried out manually by a professional interpreter with a multi-year experience in translating from English, French and Italian into Spanish the verbatim reports of the European Parliament plenary meetings. This guarantees the high level of language knowledge and domain expertise required to ensure maximum quality and precision. To ease the task, the annotator was provided with transcripts  and translations automatically pre-annotated with the BERT-based NER model 4 available in Deep-Pavlov (Burtsev et al., 2018). Human annotation was then conducted in parallel on the three test sets by labelling, for each audio segment, the English transcript and the three corresponding translations.
To check annotations' reliability, all the English transcripts were also independently labelled by a second annotator with a background in linguistics and excellent English knowledge. Inter-annotator agreement was calculated in terms of complete agreement, i.e. the exact match of the whole NE in the two annotations. The resulting Dice coefficient 5 (Dice, 1945) amounts to 93.87% and can be considered highly satisfactory. For the subset of NEs for which complete agreement was found (1,409 in total), we also computed the agreement on labels' assignment with the kappa coefficient (in Scott's π formulation) (Scott, 1955;Artstein and Poesio, 2008). The resulting value is 0.94, which corresponds to "almost perfect" agreement according to its standard interpretation (Landis and Koch, 1977). Terminology annotation. Similar to (Dinu et al., 2019), terminology was automatically extracted by exploiting the IATE term base. 6 Each entry in IATE has an identifier and a language code. Entries with the same identifier and different language codes represent the translations of a term in the corresponding languages. To annotate our parallel texts, we first removed stop-words and lemmatized the remaining words and IATE entries. 7 Then, for each parallel sentence, we marked as terms only those words in the source and the target side that were present in IATE with the same identifier. This source/target match is essential to avoid the annotation of words that are used with a generic, common meaning but, being polysemic, can be technical terms in different contexts (e.g. the word "board" can refer to a tool or to a committee). Checking the presence of the corresponding translation in the target language disambiguates these cases, leading to a more accurate annotation. NE and term annotations were merged into a single test set using BIO (Ramshaw and Marcus, 1995) as span labeling format. Had a word been tagged both as term and NE, the latter was chosen favoring the more reliable manual annotation. Table 1 presents the total number of NEs and terms for the three language pairs, together with their corresponding number of tokens. 8 These numbers differ between source and target texts and across pairs due to the peculiarities of the Europarl-ST data. Specifically, i) sometimes translations are not literal and NEs are omitted in the translation (e.g. when a NE is repeated in the source, one of the occurrences may be replaced by a pronoun in the target text), ii) the professional interpreters and translators "localize" the target translations, i.e. adapt them to the target culture (e.g. while the English source simply contains the name and surname of mentioned European Parliament members, in Italian the first name is omitted and the surname is preceded by "onoverole" -honorable), and iii) the number of words a NE is made of can vary across languages (e.g "European Timeshare Owners Organisation" becomes "Organización Europea de Socios de Tiempo Compartido" in Spanish).

Results
We use our benchmark to measure systems' ability to handle NEs and terminology. Besides comparing the two ST models described in Section 2, we extend our evaluation to the ASR and MT sub-components (the latter being fed with human transcripts) of the cascade system. As shown in Table 2, all models are evaluated in terms of overall output quality and accuracy in rendering the two categories of rare words subject of our study. Transcription and translation quality are respectively measured with WER and SacreBLEU 9 (Post,   2018). Similarly to the Named Entity Weak Accuracy proposed in (Hermjakob et al., 2008), we compute NE/term accuracy 10 as the ratio of entities that are present in the systems' output in the correct form. We present case-insensitive accuracy scores to fairly compare the different models, as the ASR produces lowercase text. For the sake of completeness, case-sensitive NE/term accuracy is also given in Table 3 for ST and MT models (we do not include ASR since it generates lowercase text). Comparing these results with those reported in Table 2, for all language pairs we see that the drop in NEs accuracy with respect to case-insensitive scores is higher for the cascade model -around 5 points -than for the direct one -around 2 points (e.g. for en-es, from 70.9 to 65.8 for the cascade model and from 71.4 to 69.4 for the direct model). We posit the reason is the propagation of errors in the module in charge to restore casing on the ASR output in the cascade architecture.

ASR and MT results
The WER of the ASR is similar across the three language directions. This is not surprising because the three test sets differ only in very few debates. In terms of accuracy, it is evident that transcribing NEs is more difficult than transcribing terms (84.5 vs 92.4 on average). Besides lower frequency, the higher difficulty to transcribe NEs can be ascribed to the variety of different pronunciations by nonnative speakers (in particular for person, product 10 Scores have been computed with the script available at: https://github.com/mgaido91/ FBK-fairseq-ST/blob/emnlp2021/scripts/ eval/ne_terms_accuracy.py and organization names). Concerning the MT performance, the BLEU differences between language directions (en-es en-fr > en-it) reflect the results reported in the Europarl-ST paper (Iranzo-Sánchez et al., 2020). The main reason is that the translations are less literal for some language directions. For instance, the French references are 20% longer than the human source transcripts. Analyzing NE and term translation quality, we notice that NEs are, again, harder to handle compared to terminology (average accuracy: 80.8 vs 87.1). It is worth to notice that accuracy does not strictly depend on translation quality. For instance, en-fr has a higher translation quality than en-it (+2.4 BLEU points), but NE and term accuracy scores are lower.

ST results
Unsurprisingly, when it comes to combining transcription and translation in a single task, performance decreases significantly. In particular, the results of the cascade model are a direct consequence of cumulative ASR and MT errors. As such, like for its sub-components, NEs are harder to handle than terms. Compared to MT results computed on manual transcripts, we see large drops in all languages on both translation quality (-13.2 BLEU on average) and NE/term accuracy (-12.8/-6.0).
Comparing cascade and direct models, the BLEU scores are on par for en-es and en-it (differences are not statistically significant 11 ), while the direct one is significantly better for en-fr. This is explained by the aforementioned peculiarity of the French reference translations in Europarl-ST that, unlike in common training corpora (Europarl included), are on average 20% longer than the source transcripts. The MT model of the cascade, trained on massive corpora including Europarl, tends to produce translations that are similar in length to the transcripts and shorter than Europarl-ST references, being thus penalized. Having Europarl-ST among its training corpora, the direct model produces outputs more similar in length to the references, resulting in a 2.8 BLEU gain. In terms of NE and term translation quality, the trend is clear and coherent on all languages: the cascade outperforms the direct on terminology (+3.5 on average), while the direct has an edge (+0.5) in handling NEs. The advantage of the cascade on terminology can be explained by the higher reliability of its MT component in selecting domain-specific target words compared to the direct models built on much smaller ST training corpora. One example is the English term "plastic explosive", which is correctly translated into Italian by the cascade ("esplosivo plastico"), and wrongly by the direct ("esplosivo di plastica" -En: "explosive made of plastic"). Concerning NEs, instead, the unmediated access to the audio helps the direct to avoid both i) error propagation (e.g. the NE "Lamfalussy" is correctly translated by the direct, while the MT component of the cascade is not able to recover the wrong ASR output "blunt Hallucy"), and ii) the translation of NEs that are homographs of common nouns in the source language but should be copied as is (e.g. the English surname "Parish" is translated into Italian as "Parrocchia" by the cascade, but correctly preserved in the direct's output).
Looking at NE types (complete results in the Appendix), the two ST systems are always close to each other, reflecting the global accuracy scores in Table 2. For both approaches, the differences across the NE types depend on their capability to recognize entities in the audio and properly translate them. Two types are paradigmatic (see Figure 1). PERSON names (the worst category, with 37-40% ST accuracy) are difficult to recognize in the audio, as shown by the poor performance of ASR and both ST systems, while their translation from manual transcripts (MT) is trivial as it only requires copying them from the source. Conversely, ST and MT results are very close on the more frequent and normally easier to pronounce LOCATION names, for which the problem lies more in translation than in recognition.

Conclusions
While previous ST research has focused on improving overall systems' performance, little has been done to evaluate the existing paradigms in relation to well known specific problems in automatic translation at large. Translating rare words is no exception, also due to the dearth of suitable labelled benchmarks. To fill this gap, we focused on named entities and terminology, which combine the problems inherent to low frequency in the training data with the difficulty of recognizing them in the audio and mapping their meaning into few valid options. We created NEuRoparl-ST, an annotated benchmark covering three language directions, and used it for the first comparison of state-of-the-art cascade and direct ST systems on NE and term translation. Our results show that NEs, especially person names, are in general more difficult to handle than terminology.

A.1 Cascade ST Model
The ASR component of our cascade is a Transformer-based (Vaswani et al., 2017) model consisting of 11 encoder layers, 4 decoder layers, 8 attention heads, 512 features for the attention layers and 2,048 hidden units in the feed-forward layers. Its encoder has been adapted for processing speech by means of two initial 2D convolutional layers that reduce the input sequence length by a factor of 4. Also, the encoder self-attentions are biased using a logarithmic distance penalty that favors the local context (Di Gangi et al., 2019). Similar to (Gaido et al., 2020a), the model is trained with an additional Connectionist Temporal Classification (CTC) loss (Graves et al., 2006), which is added as a linear layer to the 8th encoder layer. As training data, we used LibriSpeech (Panayotov et al., 2015), TEDLIUM v3 (Hernandez et al., 2018 and Mozilla Common Voice, 12 together with (utterance, transcript) pairs extracted from three ST corpora: MuST-C (Cattoni et al., 2021), Europarl-ST, and CoVoST 2 . We augment data with SpecAugment (Park et al., 2019) and, after lowercasing and punctuation removal, text is split into sub-words with 8,000 BPE (Sennrich et al., 2016) merge rules. We set the dropout to 0.1. We optimize label smoothed cross entropy with smoothing factor 0.1 with Adam (Kingma and Ba, 2015). The learning rate is increased for 5,000 steps from 0.0003 up to 0.0005 and then decays with inverse square root policy. Our mini-batches are composed of up to 12K tokens or 8 samples and we delay parameter updates for 8 mini-batches. We train on 8 GPU K80 (11GB RAM). Before feeding the MT with the ASR outputs, the transcripts are post-processed by an additional model to restore casing and punctuation. This model is a Transformer-based system trained on data from the OPUS repository, where the source text is lowercased and without punctuation and the target text is a normally formatted sentences.
The MT component is a Transformer model with 6 layers for both the encoder and the decoder, 16 attention heads, 1,024 features for the attention layers and 4,096 hidden units in the feed-forward layers. Training data were collected from the OPUS repository, 13 (Szegedy et al., 2016) with Adam, with a learning rate that linearly increases for 8,000 updates up to 0.0005, after which decays with inverse square root policy. Each batch is composed of 4 mini-batches made of 3072 tokens. Dropout is set to 0.3. We train for 200,000 updates and average the last 10 checkpoints. Source and target languages share a BPE (Sennrich et al., 2016) vocabulary of 32k sub-words.

A.2 Direct ST Model
Our direct model has the same architecture of the ASR component described above, which is also used to initialise its encoder weights (Bansal et al., 2019). In addition, we exploit data augmentation and knowledge transfer techniques successfully applied by participants in the IWSLT-2020 evaluation campaign (Ansari et al., 2020;Potapczyk and Przybysz, 2020;Gaido et al., 2020a). For data augmentation, we use SpecAugment and time stretch , together with synthetically generated data obtained by translating with our NMT model the transcripts contained in the ASR training corpora. Besides encoder pre-training, for knowledge transfer we also apply knowledge distillation (KD): as in (Liu et al., 2019;Gaido et al., 2020b), our student ST model is trained by computing the KL divergence (Kullback and Leibler, 1951) with the output probability distribution of the NMT model used as teacher.
The whole training procedure is carried out in three phases: starting from the synthetically generated data (with the KD loss function), continuing with MuST-C and Europarl-ST (still with KD), and concluding with fine-tuning on the same ST data, but switching to the label-smoothed cross entropy loss. Table 4 presents the number of named entities (NEs) and terms annotated in the test sets, divided by category. Since both NEs and terms can be com-  Table 5: Case insensitive accuracy scores for all the NE types on the three language pairs. We report the results for ASR, MT, Cascade (Casc.) and Direct (Dir.) systems.

B Statistics for the Annotated Test Sets
posed of more than one word (e.g. for a person it is common to have both the name and surname), the total number of tokens per category is also given.