Towards Speech Dialogue Translation Mediating Speakers of Different Languages

We present a new task, speech dialogue translation mediating speakers of different languages. We construct the SpeechBSD dataset for the task and conduct baseline experiments. Furthermore, we consider context to be an important aspect that needs to be addressed in this task and propose two ways of utilizing context, namely monolingual context and bilingual context. We conduct cascaded speech translation experiments using Whisper and mBART, and show that bilingual context performs better in our settings.


Introduction
In this global era, it is becoming increasingly important for people from different countries/regions to interact with each other and have a mutual understanding.Recent advancements in machine translation (MT) technologies have enabled us to communicate with people worldwide, especially in text.Chat translation or dialogue machine translation (Liu et al., 2021) supports such communications, which enables people who use different languages to have cross-language chats.Speech translation (ST) has also recently shown success (e.g., Chen et al., 2022), especially in monologue translation (e.g., Di Gangi et al., 2019).However, to the best of our knowledge, no study has focused on ST of dialogues, which is an important aspect of language usage.
In this study, we propose a new task: speech dialogue translation (SDT) aiming to mediate speakers of different languages.We consider bilingual dialogues where several people who speak in different languages talk with each other mediated by an ST system.
It is important to consider context in SDT because we need to consider context in different languages, which cannot be readily handled by current ST systems that mainly focus on one translation direction.Figure 1 shows an example of an STmediated dialogue between an English speaker and a Japanese speaker.They are discussing some ideas, and the English speaker says, "What do you think about it?"The Japanese speaker responds by saying the idea is naive, but without context it can be translated as "I think it's a bit sweet" because "甘い" has two meanings, sweet and naive.By utilizing dialogue context, the meaning of "甘い" becomes clear so that the utterance can be translated properly.
For the proposed task, we construct the SpeechBSD dataset1 based on an existing text dialogue corpus, BSD (Bussiness Scene Dialogue) corpus (Rikters et al., 2019).We collect audio of the BSD corpus through crowdsourcing along with speaker attributes.
We conduct speech-to-text cascaded ST experiments on the dataset.There are two mainstream methods for ST, the cascade method (Stentiford and Steer, 1988) where automatic speech recognition (ASR) and MT are chained together, and the end-to-end method (Duong et al., 2016;Berard et al., 2016), where translations are directly predicted from speech.Recent study (Bentivogli et al., 2021;Tran et al., 2022) suggests that the two methods are on par.We conduct cascade ST experiments using Whisper (Radford et al., 2022) for ASR and mBART (Liu et al., 2020) for MT.
We consider three settings for translation: without context, with monolingual context, and with bilingual context.The monolingual context is composed in the language the utterance to be translated is spoken, whereas the bilingual context is composed in the original language of the spoken utterances (see examples in Figure 1).We show that translation with bilingual context performs better compared to the one without context by up to 1.9 BLEU points in MT and 1.7 BLEU points in cascade ST with our settings.We also conduct a manual evaluation focusing on zero anaphora, a grammatical phenomenon where arguments of verbs are omitted when they are apparent from the context in Japanese.We show that with bilingual context, the MT models can often predict zero pronouns correctly.

Related Work
Although neural MT has greatly improved over the past few years, the translation of dialogues remains a challenging task because of its characteristics.Liu et al. (2021) summarizes the recent progress of dialogue MT and categorizes its issue into four categories, coherence, consistency, cohesion, and personality.The main approaches to address these problems include document MT (e.g., Liu et al., 2021), usage of pretrained models (e.g., Wang et al., 2020), and auxiliary task learning utilizing speaker information (e.g., Liang et al., 2021).
Considering context in ST is recently studied for the end-to-end approach (Zhang et al., 2021).We point out that although not addressed in this work, considering context for ASR is also an active research area (e.g., Inaguma and Kawahara, 2021).
In this work, we focus on the translation of speech dialogue.We use mBART, which performed best in a previous work of chat translation (Liu et al., 2021), and also consider utilizing context.

Speech Dialogue Translation (SDT)
In SDT, there are several speakers who speak different languages with the help of a translation system.In this work, we consider M speak- , where an utterance is U t = (S m t , L n t , X t ).Here, S m t is the speaker, L n t is the language spoken, and X t is the speech signal of t-th utterance.Let Y n t (n = 1, 2) be text that has the same meaning as X t in language L n .The task of SDT is to generate translation Y 2 t from speech signal X t when the source language is L 1 (or translation Y 1 t from X t when the source language is L 2 ) for every utterance U t .

SpeechBSD Dataset
We construct the SpeechBSD dataset to study SDT.It is based on the existing dialogue dataset in text, BSD corpus (Rikters et al., 2019(Rikters et al., , 2021)).We collect audio of all the sentences in the dataset along with speaker attributes (gender and homeplace) through crowdsourcing.

BSD Corpus
BSD corpus is a parallel corpus of English and Japanese composed of manually designed business scene dialogues.Each dialogue called scenario contains 30 sentences on average spoken by 2-5 speakers.The original language the scenarios were written in is half English and half Japanese so that the expressions are not biased toward one language.

Dataset Construction
First, we divided each scenario by speaker.For example in Figure 1, the original BSD corpus con-  and  Y 2 3 ).In this way, we can compose two crosslanguage dialogues (Y ) from one scenario of the BSD corpus.We collected audio through crowdsourcing so that each part is spoken by a different worker. 2 We designed a web application to record audio and collected English speech from the US using Amazon Mechanical Turk3 and Japanese speech from Japan using Yahoo!crowdsourcing. 4 We also collected the gender and homeplace (the US states or Japanese prefecture) of the speakers as they may affect translation performance.The instructions given to the workers are shown in Appendix A.1.

Statistics of the SpeechBSD Dataset
The collected audio was 24.3 hours for English speech and 30.7 hours for Japanese speech in total.Details are provided in Appendix B Table 2. Regarding speaker gender, English speech was balanced, whereas there were more male speakers in Japanese.As for homeplace, in Japanese, the speakers were distributed roughly according to the population distribution.In English, it was less diverse (Appendix B Figure 3).

Considering Context for SDT
We propose two ways to consider context in SDT: monolingual context and bilingual context.
First, for every utterance U t , an ASR system is used to obtain transcripts Y n t .The monolingual context is composed in the source language of the utterance to be translated.For example, in Figure 1, when translating the third utterance U 3 from Japanese to English, as the source language of the utterance is Japanese (L 1 ), the context (Y 1 1 and Y 1 2 ) is also composed in Japanese.Let the context composed in this way be Y n <t .
For monolingual context experiments, we use two translation models for each translation direction.The training objective of the MT model that translates from L 1 to L 2 is to maximize the follow-ing log likelihood5 : Similar objective L 2→1 can be derived when L 2 is the source language and L 1 is the target language.
Postprocessing is applied to extract Y 2 t from the output that contains both Y 2 <t and Y 2 t .The bilingual context is composed of the original language of the spoken utterances.For example, in Figure 1, when translating the third utterance U 3 from Japanese to English, the bilingual context on the source side is Y 1 1 and Y 2 2 , which involves both languages.The bilingual context on the target side is Y 2 1 and Y 1 2 .Because there is no concept of source or target language in this case, let the source side utterance be Y t , source side context be Y <t , target side utterance be Y t , and target side context be Y <t .The MT model is trained with the following objective: (2) Postprocessing is applied to extract Y t from the output.
We consider constrained context with context size c in practice, which shows the number of previous utterances used for translation in addition to the utterance to be translated.More formal definitions of monolingual, bilingual, and constrained context are provided in Appendix C.

Automatic Speech Recognition
In SDT, ASR has to handle bilingual inputs.We used a multilingual ASR model Whisper (Radford et al., 2022).The medium model with 12 encoder and decoder layers was used without finetuning.Further details are provided in Appendix D.1.We evaluated the performance of the SpeechBSD test set.For English the word error rate was 8.3 %, and for Japanese the character error rate was 13.2 %.

Machine Translation
MT model also needs to handle bilingual inputs in SDT.We used mBART (Liu et al., 2020) and finetuned the model with SpeechBSD for MT.The large model with 12 encoder and decoder layers was used.Although the dialogues are regarded as bilingual ones in this study, the predictions were recomposed to the monolingual dialogue form for evaluation because usually performance of MT models is evaluated on a single language pair.SacreBLEU (Post, 2018) was used for calculating BLEU scores.Further details are provided in Appendix D.2.

Context Settings
Three settings were considered: translation without context, with monolingual context, and with bilingual context.
Without Context Each utterance in a scenario was treated as a separate sentence in this setting.Finetuning was performed separately for each translation direction.

Monolingual Context
For each utterance in a scenario, monolingual context with context width c = 5 was composed in the way described in section 5.The context utterances and the utterance to translate were concatenated with the end of sentence token </s>.Finetuning was performed separately for each translation direction.
Bilingual Context For each utterance in a scenario, bilingual context with context width c = 5 was composed in the way described in section 5.The context utterances and the utterance to translate were concatenated with the end of sentence token </s>.As there is no concept of source language or target language in this setting, a single model was finetuned in this setting.

Results
Table 1 (upper part) shows the results of the MT experiments.Comparing "Without" with "Monolingual," more than 0.9 points of improvement were observed using monolingual context.Comparing "Monolingual" with "Bilingual," the latter performed better, especially in Ja-En.

Manual Evaluation
To verify how context can help improve translations, we conducted a manual evaluation focusing on a grammatical phenomenon called zero anaphora, as discussed in Rikters et al. (2019).Similarly to Rikters et al. (2019), we counted the number of sentences with pronouns I, you, he, she, it, and they in English6 and observed that 63 % of the test sentences included them.We sampled 50 of those sentences from the test set.First, we checked if the subjects of the Japanese sentences were zero pronouns by comparing Japanese and English gold references.Then we checked if the zero pronouns were translated into English correctly for the predictions of each Ja-En system.Out of the 50 sentences, 29 were sentences with zero pronoun subjects.The number of sentences that the missing pronoun was translated correctly was 19, 20, and 24 for without context, monolingual context, and bilingual context settings, respectively.This shows that context can help disambiguate zero pronouns, and using bilingual context can help generate correct pronouns.Examples of the sentences are shown in Appendix E.

Cascade Speech Translation
Cascade ST experiments were performed by using Whisper recognition results as input to the MT models described in section 6.2.
Table 1 (lower part) shows the results.Similarly to MT, BLEU score improved by more than 0.7 points by using monolingual context.Further improvements by more than 0.5 points were observed using bilingual context.
We also performed manual evaluation as in Section 6.2.3.The number of sentences that the missing pronoun was translated correctly was 16, 18, and 22 for without context, monolingual context, and bilingual context settings, respectively.It showed a similar trend to the results of section 6.2.3 with lower translation accuracy.Examples of the sentences are shown in Appendix E.

Conclusion
We presented a new task, SDT aiming to mediate speakers of different languages.We constructed the SpeechBSD dataset via crowdsourcing.We performed MT experiments utilizing context and showed its effectiveness.In the future, we plan to perform experiments in end-to-end ST settings and SDT utilizing speaker attributes.

Limitations
The experiments were performed only on Japanese and English bilingual dialogue collected from a limited number of native speakers.Although the methods proposed in this work can work on any language pair, drawing conclusions for other language pairs should be avoided.The experiments were performed using existing pretrained models, Whisper and mBART, and the method used to pretrain those models would have affected the translation performances in this work.The dialogues in the SpeechBSD dataset are the read speech of pre-composed text dialogues, and further research is required for more realistic settings such as spontaneous dialogues.

Ethics Statement
Consent was obtained from the crowdsourcing workers when collecting audio, gender, and homeplace.The SpeechBSD dataset is made public under the Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) 4.0 license, which is the same as the license of the BSD corpus, and shall be used only for research purposes.Caution should be exercised when using gender or homeplace information included in the dataset so that the identities of the speakers are not revealed.

A.1 Crowdsourcing Instructions Given to the Workers
Figure 2 shows the instructions given to the crowdsourcing workers and the interface used to record audio.We asked the workers to speak clearly and formally and to check that the audio was properly recorded.With the interface, we made sure that the workers agreed that their voices would be released and that the utterances were properly recorded.

A.2 Crowdsourcing Payment
The crowdsourcing tasks were divided according to the number of utterances to record.The authors performed preliminary crowdsourcing tasks and estimated how long the tasks would take for each case.We paid the workers according to the estimated time and predefined wage per hour determined for each country.

B Statistics of the SpeechBSD Dataset
Table 2 shows the statistics of the SpeechBSD dataset.Figure 3 shows the homeplace distribution of the speakers of the SpeechBSD dataset.The Japanese one (3(b)) roughly reflects Japan's demographics (concentrated around Tokyo, Osaka, and Nagoya), whereas the English one (3(a)) is more biased (concentrated too much on California and Virginia).We believe these biases are caused by the differences in the crowdsourcing platforms used.

C Formal Definition of Context
Here, we formally formulate monolingual, bilingual, and constrained contexts introduced in Section 5.
For simplicity, we consider the case where M = 2 and m = n (i.e., speaker S i speaks in language L i (i = 1, 2)).In addition, we suppose the speakers speak interchangeably, and speaker S 1 starts the conversation. 7In other words, defining a map L : The monolingual context is composed of previous utterances in a single language.In other words, 7 In the experiments, consecutive utterances by the same speaker are treated as separate utterances.If there are more than three speakers, we number speakers in the order of appearance and regard speakers with the same parity speak in the same language.
monolingual context text of utterance U t in language L i is For example in Figure 1, when translating the third utterance U 3 from Japanese to English, the monolingual context of the source side is "彼は良い考 えだと言ってました。あなたはどう思います か?", and that of the target side is "He said it's a good idea.What do you think?"Using this formulation, we can formally define the training objective of Equation 1.During inference, for the source language of the current utterance, ASR transcripts are used, and for the target language of the current utterance, the translations of ASR transcripts are used to compose context.During training, the corresponding gold text is used.The Bilingual context is composed of transcripts of the two languages.ASR transcripts are used during inference, and gold transcripts are used for training.The bilingual context of utterance U t is where For example in Figure 1, when translating the third utterance U 3 from Japanese to English, the bilingual context of the source side is "彼は良い考え だと言ってました。What do you think about it?",and that of the target side is "He said it's a good idea.あなたはどう思いますか?"For bilingual context experiments, the MT system has to be able to handle two translation directions.Let the translation of Y <t be Y <t = Ỹ 1 <t ∪ Ỹ 2 <t , where . By setting Y <t as source side context and target side context as Y <t , we can formally define the training objective of Equation 2.
In practice, we consider context width c for context U <t = {U τ | τ < t} because the maximum length the MT models can handle is limited.The constrained context of utterance U t with context width c is

D Experimental Settings D.1 ASR
Whisper is a Transformer-based model that uses 80-channel log-Mel spectrograms converted from audio sampled with 16, 000 Hz as input.As it is trained with 680, 000 hours of data in various domains the model is robust enough to be able to work without any finetuning.We used the byte-level BPE vocabulary (size 50, 257) of the pretrained model.We assumed the language of the utterances was given beforehand and fed the language tag to the model as a prefix token.We evaluated the development set of the SpeechBSD dataset using the base, small, medium, and large models with either greedy decoding or beam search decoding with beam size 5.We observed that the medium model with greedy decoding performed the best for both English and Japanese, which are the settings used for further experiments.

D.2 MT
We used mBART trained with 25 languages for the experiments.BPE vocabulary of size 25, 001 was used.As a preprocessing step, BPE was applied to all utterances with the sentencepiece (Kudo and Richardson, 2018) toolkit.Fairseq (Ott et al., 2019) was used for training and inference.The same hyperparameters as in Liu et al. (2020) were used, except that the training epochs were determined according to early stopping with patience 10 on validation loss.We did not use different random seeds for the experiments because Liu et al. (2020) reported that the finetuning process was stable with different seeds.When evaluating the model, the averaged weights of the last 10 checkpoints were used.The SacreBLEU signatures were nrefs:1|case:mixed|eff:no|tok:ja-mecab-0.996-IPA|smooth:exp|version:2.0.0 for En-Ja and nrefs:1|case:mixed|eff:no|tok:13a| smooth:exp|version:2.0.0 for Ja-En.We conducted significance tests with paired approximate randomization (Riezler and Maxwell, 2005) with 10, 000 approximate randomization trials and a p-value threshold of 0.05 to compare the BLEU scores of "without context" with the others, and "monolingual context" with "bilingual context." For bilingual context MT experiments, in order to match the finetuning style of mBART, language tags like ja_XX or en_XX have to be appended at the last of each translation unit.However, in bilingual context settings, both the source and the target side contain both languages, which does not comply with the finetuning style described in the original mBART paper (Liu et al., 2020).We conducted two kinds of experiments, appending ja_XX to the input and en_XX to the output and the other way around.The statistical significance test showed that they were not significantly different.We report the results of the systems where the language pair of the utterance to be translated matches the language pair specified by the appended language tags.
As to the context size c, we changed it from 1 to 8 in the bilingual context setting and evaluated the models with BLEU score on the validation set.The results are shown in Figure 4.In the bilingual context setting, 5 was the best for both En-Ja and Ja-En.For the monolingual context setting, 5 and 6 were the best for En-Ja and 3 for Ja-En.The difference between setting 3 and 5 as context width did not show a statistically significant difference in the BLEU scores for Ja-En.Therefore, for a consistent comparison, we reported the results on the test set with c = 5 in Table 1.
We used 4 Tesla V100 or Titan RTX GPUs for the experiments.The total computation hours, including hyperparameter searching, were 278 hours.

E Example Sentences from Manual Evaluation
Table 3 shows examples from manual evaluation described in Section 6.2.3.In the first example, it is observed that the zero pronoun (She) is predicted correctly when monolingual or bilingual context is used in both MT and cascade ST experiments.In the second example, the zero pronoun (They) could not be correctly predicted by any system.

Figure 1 :
Figure1: The importance of considering context in SDT."甘い" can be translated into either "sweet" or "naive," which can be disambiguated with the context.We consider two types of context for translation, monolingual context and bilingual context.

Figure 2 :
Figure 2: Crowdsourcing interface used to record audio.The upper part shows the instructions given to the workers.

Figure 3 :
Figure 3: Homeplace distribution of the speakers of the SpeechBSD dataset by the number of utterances.

Figure 4 :
Figure 4: BLEU score on the development set when changing the context size c.

Table 2 :
Statistics of the SpeechBSD dataset.The number of sentences is the same as the number of utterances in this dataset as it originally was in the BSD corpus.