Terminology-Aware Translation with Constrained Decoding and Large Language Model Prompting

Terminology correctness is important in the downstream application of machine translation, and a prevalent way to ensure this is to inject terminology constraints into a translation system. In our submission to the WMT 2023 terminology translation task, we adopt a translate-then-refine approach which can be domain-independent and requires minimal manual efforts. We annotate random source words with pseudo-terminology translations obtained from word alignment to first train a terminology-aware model. Further, we explore two post-processing methods. First, we use an alignment process to discover whether a terminology constraint has been violated, and if so, we re-decode with the violating word negatively constrained. Alternatively, we leverage a large language model to refine a hypothesis by providing it with terminology constraints. Results show that our terminology-aware model learns to incorporate terminologies effectively, and the large language model refinement process can further improve terminology recall.


Introduction
One of the major obstacles encountered by neural machine translation (NMT) systems pertains to the utilization of suitable domain-related words when translating specialized content not present in the training data.An illustrative instance of this challenge arises when translating "transformer" from English into another language, where the accurate translation depends on the context or the preference of the audience (Figure 1).A straightforward literal translation approach often leads to suboptimal outcomes, prompting human translators unfamiliar with domain-specific knowledge to resort to reference materials for terminology precision.This issue is prevalent in the translation industry, with many commercial translation service providers offering paid solutions to address it.Furthermore, it * Equal contribution.
Translate "transformer" to Chinese? 变压器 (electric transformer) 变形金刚 (the Transformer character) 变换器 (something that changes) Figure 1: Terminology hints can help disambiguate polysemantic words when translating with limited context.is a popular area in machine translation research, indicated by efforts such as WMT shared tasks organization and participation focusing on terminology and domain-specific translations (Alam et al., 2021;Bawden et al., 2019Bawden et al., , 2020, inter alia), inter alia).This year's WMT terminology translation task features three language directions: German-to-English, Chinese-to-English, and English-to-Czech.In addition to reading in a source sentence, participating systems need to employ a provided dictionary, which contains source-target terminology word mappings, to incorporate into the target translation.For each source sentence in the test set, there are three modes of applying terminology constraints: 1. Terminology constraint: Dictionaries of real terminology words are provided, to be incorporated in the translations.
2. Random constraint: Random (but presumably correct) word mappings are obtained using a word alignment tool and provided as a pseudoterminology dictionary.

3.
No constraint: Source sentences can be freely translated without external information.
We interpret that the no-constraint setting allows us to measure the competing systems' quality and understand to what degree the systems effectively utilize the provided random and terminology dictionaries.Our baseline approach is to train a terminology-aware translation (TAT) system inspired by Dinu et al. (2019), where, in the training data, source words are tagged with desired translations inline on the source side.Then we propose two separate refinement strategies on top of it to aggressively encourage the appearance of terminologies: 1. We use a neural word aligner to identify terminology constraints missed by the baseline system, and use the same system to re-decode the source by negatively constraining (disallowing) previously incorrectly translated tokens.
2. We also investigate the capability of a large language model to simultaneously paraphrase an existing translation to include the desired terminology constraints via curated prompts.
Our proposed techniques can incorporate target terminology words with around 80% recall, using automatic and soft constraints in a two-step refinement process.We observe that for German-English, our terminology-aware training and negatively constrained decoding perform better, whereas, for Chinese-English and English-Czech, LLM-based refinement achieves higher scores.In terms of overall translation accuracy, we find that negatively constrained decoding could lead to a tiny drop and LLMs are able to maintain or improve quality according to a reference-free neural metric.

Related Work
Previous research on terminology translation could be divided into two categories: soft constraint and hard constraint, depending on whether the resulting translation system will enforce the appearance of desired target translations.In the soft constraint setting, the convention is to train a model that is able to ingest the target terminology words inline, directly placing them after the corresponding source words in the source input (Dinu et al., 2019).Many later implementations stem from this to include new elements such as additional lemmatization (Bergmanis and Pinnis, 2021) or grammatical error correction (Pham et al., 2021) as a post-processing step in order to achieve a more fluent output.Instead of placing the target constraint words inline, some other works train a system that takes the terminology constraint as either a prefix or a suffix (Jon et al., 2021;Turcan et al., 2022).
Most hard constraint work involves postprocessing a translation with desired terminologies.Post et al. (2019) inserted untranslatable tokens (also known as placeholders) into the source, which will remain unchanged through the translation process.Then the placeholders are replaced with terminology words in the target language.This is entirely performed as a post-processing step.Such terminology replacement could also be done by keeping and replacing the source word at inference time, and it is also feasible to run target word replacement as post-processing (Molchanov et al., 2021).A hard constraint method guarantees that the chosen terminology token will appear, but often results in less fluent output, especially for morphologically rich languages because the context is not taken into consideration during replacement.It also mandates more complicated post-processing than the soft constraint approaches.
Our first post-processing proposal relies on constrained decoding, which refers to either allowing certain tokens or blocking specific tokens during inference time (Hokamp and Liu, 2017).It has been applied to terminology injection, paraphrasing, parallel sentence mining, etc (Hasler et al., 2018;Kajiwara, 2019;Chen et al., 2020).We opt for negatively constraining the tokens that violated the given terminology alignments by preventing them from entering the hypothesis beam in the refinement stage.These alignments are computed using word alignment tools (Dyer et al., 2013;Dou and Neubig, 2021).
Another post-processing method in our study prompts an LLM to refine a translation and incorporate terminology terms simultaneously.Whilst previous studies have explored the translation capability of LLMs (Vilar et al., 2023;Zhang et al., 2023), the works closely relevant to us are from Moslem et al. (2023) and Ghazvininejad et al. (2023).We adopt the paradigm from the latter, which re-words a constraint dictionary as a natural text and affixes it into a translation prompt.While they focused on rare words without directly benchmarking on terminology translation, our post-processing step can be seen as an extension of word-level controlled prompting to terminology translation with large language models.Both of our post-processing methods should be categorized as soft constraint approaches since there is no guarantee that negatively constrained decoding or an LLM will necessarily incorporate the constraints in a re-generation.

Terminology-Aware Training
The goal of our system implementation is to create a general-purpose terminology-aware translation system that is unsupervised and domain-agnostic, and requires the minimum effort of pre-and postprocessing.

Terminology creation
Inspired by Dinu et al. (2019), we applied terminology constraints during training, but a key difference is that, unlike their approach, we assume that we have no access to downstream domain or terminology constraints during training, in order to build a general-purpose domain-agnostic system.Consequently, we have no curated terminology data to use.Therefore, we generate (pseudo-)terminology information using word alignments.Our workflow can be detailed as: 1. We compute the word alignment information for the entire training set using fast_align (Dyer et al., 2013).
2. For each sentence, we select all bijective source-target mappings as our terminology candidates.We also filter out trivial mappings where the source and target tokens are the same (e.g.numbers, names), because those mappings are simple and hence likely to be correctly translated by a translation system even without any terminology awareness.
3. In the training data, we replace srcword i in the source sentence with: srcword i __target__ trgword j __done__ where the srcword i is the i-th source word inside the sentence, and trgword j is the word inside the target sentence, corresponding to srcword i according to word alignment information.This replacement occurs with around 10% probability for each candidate sourcetarget pair.For a sentence that does not have an associated terminology constraint, the data is the same as normal NMT.
4. At inference time, we process the test data similarly to above, except that the source-target word mapping comes from a supplied terminology dictionary.
In practice, our translation system is trained with a mix of normal translation data and terminologyinjected data.The advantage of this strategy is that the trained models are general-purpose, so they can translate normal texts without terminology injection.Further, they have been exposed to a wide variety of constraints during training, making them robust to potentially unseen domain constraints.
Overall, our method is very similar to Bergmanis and Pinnis (2021)'s work, except that we use whole words but not lemmas to ease pre-processing.We presume that the language model will be able to adjust the terminologies accordingly, especially for morphologically rich languages on the target side.This enables our method to be trivially transferable across languages.
Finally, our systems could easily be turned into hard-constrained by replacing the source word with the desired target terminology word.This could be feasible because our terminology-aware training installs the copying behaviour in the neural translation model, although in this mode the model would produce markedly less fluent output.

Model architecture
We trained Transformer-style machine translation models (Vaswani et al., 2017) using the Marian NMT toolkit (Junczys-Dowmunt et al., 2018).We used the Transformer-Big preset which is a 6 encoder, 6 decoder architecture with 1024 hidden size, and 4096 feedforward size.1

Data
The terminology task uses the same data as the constrained condition in the WMT23 general translation task.We carefully cleaned, filtered, and deduplicated the available WMT training sets provided by the organisers, as well as the available back-translation data.After preprocessing we were left with the following: • German-to-English (de-en): 199M lines of parallel data and 29.5M lines of backtranslated data.
• Chinese-to-English (zh-en): 21.8M lines of parallel data and 15.6M lines of backtranslated data.
• Czech-to-English (cs-en): 61.8M lines of parallel data and 57M lines of back-translated data.

Query Prompt template
Translation Source: ${source} Please give me a translation in ${lang} without any explanation.

General quality
The quality of our models without terminology translation is shown in Table 2, where we report BLEU (Papineni et al., 2002)

Post-Translation Terminology Injection
Despite training our model with terminology awareness, there is no mechanism to ensure that the desired terminology constraint will appear on the target side.The neural network decoding behaviour is not entirely predictable, especially given the assumption of no additional domain adaptation.Below, we present two distinct strategies to try harder to promote the terminology constraints, via automatic post-editing through constrained beam search and large language models.

Negatively constrained decoding
While it is easy enough to notice when a target terminology term is not generated as per a given constraint, it is not trivial to understand which word 2 wmt22-comet-da.This is a reference-based metric which requires the source input, hypothesis, and reference.has been produced in place of the desired term.In order to do this, we make use of awesome-align, a neural multilingual word aligner (Dou and Neubig, 2021), with the following procedure: 1.For each source-translation pair, we check if all required terminology terms appear on the target side.If they do, then we stop processing more rules.
2. Then, we use awesome-align to compute word alignments and detect the word(s) that have been generated in place of the desired terms according to the provided terminology constraints.
3. We decode the source sentence again, penalising the words that violated the terminology constraint, by forbidding the decoder from generating them at each generation step, unless they carry more than 95% of the probability mass at a certain step.
In practice, this procedure can be repeated infinitely, until all terminology constraints are fulfilled, but we decided to limit it to only one iteration, to keep this a realistic production scenario in terms of computational budget.

Large language models
Recent years saw the rise of large language models (LLMs), which have a strong capability in various NLP tasks.In this paper, we investigate the effectiveness of using a large language model to generate terminology terms during translation by adding constraints to  requests a translation in the target language without any other information.On the other hand, the refinement query feeds back an unconstrained translation together with terminology constraints to request a new translation.This essentially forms an LLM version of the constrained beam search discussed in Section 4.1.The constraints are enforced through natural language instructions in the prompts, under the situation where the softmax distribution from an LLM is not accessible by users.
The LLM we use is OpenAI's GPT-3.5. 3 It is a closed-source commercial system, where the model weights and the inference states are not available to users.The model has a context window of 4096 which is sufficient to cover an instruction, a source sentence, several terminology constraints, as well as the target translation.It is public to all users at a relatively cheap cost.In our settings, each translation is carried out in a new query session.
In Table 1 we outline the two prompt templates we used.During querying, the placeholder variables are substituted with corresponding string val-3 gpt-3.5-turbo-0613, a snapshot of the GPT-3.5 model on 13 June 2023 ues.For the refinement query, when a terminology dictionary is supplied, the source and target words are fed to the LLM via the prompt (Ghazvininejad et al., 2023); if there is no terminology dictionary, the query simply asks for a refined translation.The two-step experiment with LLMs can be summarized as follows: 1. We obtain an initial unconstrained translation, which may or may not fulfil all the terminology constraints.It can come from either the LLM itself or the terminology-aware translation model built in Section 3.1.
2. We query the LLM with the constrained translation prompt to obtain a refined translation with terminology incorporated in the prompt.

Results and Discussions
We present our blind test results in Table 3, which include both terminology recall and COMET QE scores computed by us. 4 We used COMET QE in particular because it does not require references which are not accessible to us.We assess the effectiveness of our methods by comparing the terminology recall of our systems with and without applying terminology constraints, in both random and real terminology scenarios.

Translation quality
In terms of translation quality reflected in COMET QE , we observe that the LLM rows attain superior results, which is not surprising considering that we use an unconstrained commercial model GPT-3.5.By comparing TAT with TAT+NCD, or comparing LLM with LLM+LLM under a constrained scenario, we conclude that applying terminology constraints usually lead to a sacrifice in translation quality regardless of the language direction or the systems involved.Nonetheless, as a contrasting experiment with no constraint, LLM+LLM achieves a slightly better COMET QE score than using an LLM to translate without refinement.
Our model performed poorly on the zh-en task in terms of COMET QE scores.We suspect that this is because of the domain mismatch between the translation data from the general domain and the Chinese terminology test set.Upon manual inspection, we found that the latter includes web novels and literal writing which are likely to be under-represented in the generic training data.

Terminology recall
Focusing on terminology generation, compared with TAT or LLM in unconstrained settings, TAT marks 30-40 higher recall of terminology terms in the constrained terminology and random settings.This indicates that our terminology-aware training is effective in teaching translation models to follow customized source-target word alignments.
Next, as a post-processing step, negatively constrained decoding seems to be disappointing in practice.TAT+NCD often produces worse results than TAT alone in terms of both quality and terminology recall, except for zh-en with random constraints.We hypothesize that this could be due to two problems: (1) word alignment errors could propagate into this process, and (2) by applying NCD, we might capture a missed terminology term but at the cost of mis-translating other words.Our constraining procedure might be improved by performing shortlisting, namely positively constrained decoding, as opposed to negatively limiting the beam search in an iterative approach.
We find the results promising when using LLMs for terminology injection.Looking at LLM+LLM versus LLM alone in various constrained conditions, terminology recall improves significantly with very little drop in overall quality.Also by comparing TAT+LLM with TAT alone, we observe that TAT and LLMs each have their own merits depending on the language direction.In terms of recall, TAT wins in de-en, TAT+LLM wins in zh-en, and they are close in en-cs.However, TAT+LLM is way ahead if measured by COMET QE .However, we must note that an LLM costs significantly more resources than a dedicated translation model at both training and inference time.

Conclusion and Future Work
We participated in all tracks of the WMT 2023 terminology shared task with a terminology-aware translation baseline, and two distinct refinement procedures using negatively constrained beam search and large language models separately.The results we produced gave us insights into the pros and cons of our systems.In future work, we could explicitly enforce the generation of the terminology token by identifying the appropriate time step and manipulating the probability distribution after softmax computation, even in an open-source large language model.This is not entirely trivial due to the presence of subwords but could be achievable.

Table 3 :
Terminology recall and translation quality measured by COMET QE of our systems on the blind test set.