Twist Decoding: Diverse Generators Guide Each Other

Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.


Introduction
Natural language generation is an important building block for many applications, such as machine translation, summarization, and question answering (Ng et al., 2019;Lewis et al., 2020;Raffel et al., 2020;Brown et al., 2020;Asai et al., 2021, inter alia).Researchers have recently explored and advanced models for generation in various aspects, in-Figure 1: TWIST decoding of two generation models, f and g, that does not assume a shared vocabulary, tokenization, or generation order.Beam search is first applied to f to generate y (0) , followed by output mapping to ˜︁ y (0) (e.g., f 's detokenization and g's tokenization or sequence reversal).g is then decoded with beam search augmented with distances from the set of previouslygenerated outputs (here only one sequence y is shown): ≤n ).Subsequently, f is similarly decoded with g's guidance.Here we show one iteration that already achieves substantial improvements ( §4).@ indicates the BPE separator.cluding model architecture (Bahdanau et al., 2015;Vaswani et al., 2017), domain adaptation (Chu and Wang, 2018;Bapna and Firat, 2019), prompting (Brown et al., 2020), and even generation order (Gu et al., 2018).The resulting generation models are diverse, trained on different data, with different assumptions, at different times.We hypothesize that diverse generation models may achieve better results through ensembling, if the various approaches have complementary strengths.Given the high cost of unifying approaches during training time (Strubell et al., 2019;Schwartz et al., 2019), inference-time combination of existing models is an attractive alternative.
One well-established ensembling technique is "shallow fusion" (Sutskever et al., 2014;Gulcehre et al., 2015;Firat et al., 2016, inter alia), which aggregates models' scores during beam search.This approach requires, however, that the models use the same vocabulary/tokenization scheme and organize the search in the same way (e.g., autoregressive, left-to-right factorization).
We introduce a new inference algorithm, TWIST decoding (Fig. 1), that enables more diverse generators to guide each other.TWIST decoding can combine generators with different vocabularies, (de)tokenization, and even generation order without any additional training or finetuning.Our method decodes a model by standard beam search, but the scores at every step incorporate a simple function that measures the distance from outputs of the other model.We run this procedure on each generation model in turn, so that both can benefit from each other.
We present extensive experiments on machine translation and scientific paper summarization and show that TWIST decoding can improve performance over each model decoded in isolation across several scenarios: combining 1) generic and domain-specific models, 2) left-to-right and right-to-left generation models, and 3) models that generate using different conditioning inputs.Our results show consistent performance gains from combining generic and domain-specific translation models over a wide range of domains, including medical and legal translation.Applications in these domains require particularly high accuracy, and TWIST decoding is a desirable alternative to standard beam search on a single model.Interestingly, we find that TWIST decoding between generic and domain models is effective even when parallel data from the domain are scarce and the domain model yields poor performance by itself, suggesting complementary strengths of diverse generators ( §3.4).
TWIST decoding can be seen as a generalization of reranking heuristics that have proven effective in syntactic parsing (Shen and Joshi, 2003; Char-TWIST Decoding g generates z with guidance from f at iteration t k: beam size.M : maximum length.Vg: vocabulary of g. g(•): scoring function.Y (t−1) : set of output sequences from f .Z (t) : new outputs.Bn: beam of continuing sequences.H: expanded hypotheses before beam selection.d(•, •) : distance between partial sequences.λ f : scalar coefficient for the distance.
H.add(⟨s, z • z⟩) 9: Bn ← topk(H), Z (t) .add(finished(H)) 10: return Z (t) Figure 2: TWIST decoding when g is guided by f .Swap f and g and y and z to obtain Y (t) .map_output converts outputs from f to g; e.g., f 's detokenization followed by g's tokenization.It would also include sequence reversal if f or g is a right-to-left model.The highlighted line is the only modification that TWIST decoding introduces to standard beam search.The input sequence to g is omitted.See also Kasai et al. (2022b) for the stopping criterion and implementation details (the first come, first served heuristic).niak and Johnson, 2005;Collins and Koo, 2005), speech recognition (Collins et al., 2005), and machine translation (Shen et al., 2004;Och et al., 2004): one model generates candidate sequences, followed by rescoring from another model.We present extensive comparisons with reranking baselines and demonstrate that TWIST decoding outperforms reranking consistently.We also observe that since the encoder computations on two models can be parallelized, the inference time required for TWIST decoding is much shorter than the sum of the two models, resulting only in a 50% increase, relative to decoding of a single model in isolation ( §4).TWIST decoding is therefore a viable alternative to standard beam search on a single model and the widespread reranking heuristic.algorithm is a simple modification in standard beam search (highlighted in Fig. 2); we incorporate into a scoring function the distance from outputs that are previously generated by another model.

Initial Decoding
Let us assume that we have two generation models: f and g. 2 Both f and g assign scores to output sequences. 3For example, f can be a domainspecific translation model and g a generic one.f and g perform their own pre/postprocessing (e.g., (de)tokenization) and factorization (e.g., left-toright or right-to-left factorization).Here we suppress for brevity the conditional dependence on the input (e.g., machine translation input from the source language).Standard beam search with a beam size k is first applied to f to produce a set of k output sequences: Y (0) .This approximately solves topk y f (y) by pruned breadth-first search, and often returns higher-quality outputs than the exact search counterpart (Stahlberg and Byrne, 2019;Meister et al., 2020a).
Output Sequence Mapping.The commonlyused technique of ensembling (Sutskever et al., 2014) or shallow fusion (Gulcehre et al., 2015;Stahlberg et al., 2018) adds scores from f and g at every step and executes the same search algorithm to approximately solve topk y f (y) + g(y).This method thus necessitates a shared vocabulary, tokenization, and generation order (Imamura and Sumita, 2017).We relax this assumption and first map the candidates in Y (t−1) to output sequences for g: ˜︁ Y (t−1) (Line 1 in Fig. 2).This mapping (map_output) typically involves deterministic operations of f 's detokenization followed by g's tokenization.Sequence reversal is also performed if f and g generate in the opposite order.For example, if g uses byte-pair encoding (Sennrich et al., 2016b), but f does not, we might have y =John does n't like Mary mapped to ˜︁ y =Jo@ hn doesn't like Mar@ y, where @ denotes subword separation.
Decoding with Distance Terms.We then decode g with guidance from ˜︁ Y (t−1) .Specifically, we perform beam search with a simple modification in scoring (Line 7).In this work, we use a simple distance measure that adds binary distances at all positions (i.e., the Hamming distance): We also explored using the distance between (sub)word embeddings from the model: ∑︁ i≤n ∥e(z i ) − e(˜︁ y i )∥ 2 , but this did not bring improvements ( §4).Note also that when i exceeds the length of ˜︁ y, we assume ˜︁ y i = EOS.The overall distance term is then Here we minimize over the output sequences to compute the distance to the closest candidate.These candidates from ˜︁ Y (t−1) can be equally good outputs but differ only by one word; in such cases, this minimization operation avoids overestimation of the distances.The new score at step n in beam search is now computed by: where λ f is a scalar coefficient for the distance term that controls the importance of f relative to g.We tune λ f ∈ {0.1, 0.3, 1.0, 3.0} during development.After this beam search, we obtain a new candidate set, Z (t) .We then run the same beam search (Fig. 2) with the roles of f , Y and g, Z swapped. 4Namely, we decode f with distance terms from Z (t) at each step of beam search: Finally, the highest-scoring sequence from Y (t) is output.This process of mutually-guided decoding can be repeated multiple times.We observe, however, that one iteration (t = 1) suffices to bring performance gains ( §4).We also present detailed sensitivity analysis over varying λ f and λ g and find that TWIST decoding is particularly effective when λ g > λ f (i.e., initial exploration by g is encouraged with relatively little guidance from f 's original outputs; see §4).
Reranking Heuristic as a Special Case.Notice that as λ f → ∞, g's generation falls back to a reranking heuristic: top k sequences from the initial f decoding are reranked according to g.This reranking heuristic has proven successful in a wide range of sequence generation tasks, including machine translation (Shen et al., 2004), syntactic parsing (Collins and Koo, 2005), and speech recognition (Collins et al., 2005).Reranking is performed in many strong machine translation systems to use a right-to-left model to improve a left-toright model; e.g., top-performing systems in recent WMT competitions (Ng et al., 2019;Kiyono et al., 2020;Wang et al., 2021;Akhbardeh et al., 2021).
In our experiments, we extensively compare performances of TWIST decoding and reranking and demonstrate that the former consistently outperforms the latter.

Experiments
We present experiments across three scenarios: combining domain and generic models for machine translation ( §3.1), left-to-right and right-to-left machine translation models ( §3.2), and scientific paper summarization models that take as input different parts of the paper ( §3.3).We empirically compare TWIST decoding with decoding in isolation and the widely-adopted reranking baselines, illustrating that TWIST decoding offers performance improvements in various situations without any change to the trained models.

Domain and Generic Models
Machine translation has now been used for many domains, ranging from everyday conversations to medical documents.Machine translation models are often trained on large amounts of parallel data, such as the Europarl corpus (Koehn, 2005) and the OPUS data (Tiedemann, 2012).Applying these models to out-of-domain data remains a challenge (Koehn and Knowles, 2017;Chu and Wang, 2018), and users for some of these domains require high accuracy in translation (e.g., medical and legal documents).We will demonstrate that TWIST decoding between general-purpose and domain-specific models is a viable approach to tackle this problem.
Setups.We use machine translation datasets over diverse domains from prior work (Koehn and Knowles, 2017;Hu et al., 2019): German→English over medical (1.1M training sentence pairs), legal (720K pairs), Koran (religious text, 480K pairs), and subtitles (14M pairs) domains. 5For the domain-specific models, we train a base-sized transformer model (Vaswani et al., 2017) with a 6-layer encoder and a 6-layer decoder on the training data of each domain.The top-performing German→English system from WMT19 (Barrault et al., 2019;Ng et al., 2019) 6 is used as the generic model.This generic model is a largesized transformer trained on a concatenation of publicly available parallel data, including the Europarl (Koehn, 2005) and UN (Ziemski et al., 2016) corpora with the backtranslation technique (Sennrich et al., 2016a).We follow (de)tokenization (Koehn et al., 2007) and byte-pair encoding (Sennrich et al., 2016b) of previous work (Koehn and Knowles, 2017;Hu et al., 2019). 7or every domain, we evaluate a total of six configurations: decoding of the generic and domain models each in isolation; the reranking baseline and TWIST decoding with f being the generic model and g being the domain model, as well as the versions where f and g are swapped to see the effect of the two roles.In all cases, we use beam size 5 (Freitag and Al-Onaizan, 2017) and length penalty 1 (Wu et al., 2016) and conduct all experiments using the fairseq library (Ott et al., 2019).All performance is measured with the COMET score (Rei et al., 2020a,b) and the SACREBLEU implementation (Post, 2018) of the BLEU score (Papineni et al., 2002).Note that COMET is based on crosslingual contextual representations (Conneau et al., 2020), and recent work showed that it achieves significantly higher correlation with expert human judgment than BLEU and other n-gram-based metrics (Kasai et al., 2022a,c).More experimental details are described in Appendix §A.1.1 are the results from our experiments over various domains.Firstly, given two translation models f and g, TWIST decoding outperforms the reranking baseline in all configurations (indicated in blue) with only one exception (a small drop in BLEU in the subtitles domain).Particularly noteworthy are the gains in the medical domain: TWIST decoding outperforms the reranking heuristic by 5.8 COMET and 1.4 BLEU points when f is the domain model and g is the generic model.TWIST decoding is thus an effective generalization over the reranking heuristic commonly used in the literature across domains.

Results. Seen in Table
Comparing the performance of decoding in iso-  (Conneau et al., 2020) and achieves significantly higher correlation with expert human judgment than BLEU (Papineni et al., 2002) and other alternative metrics (Kasai et al., 2022a,c).
lation and TWIST decoding, we observe that the best score from TWIST decoding substantially outperforms each individual model over all domains: e.g., 81.6 vs. 80.7 (domain model) and 81.6 vs. 44.5 (generic model) COMET points in the medical domain.In both medical and legal domains, the generic model underperforms the domain model by a large margin.Nonetheless, TWIST decoding between the two improves over the domain model, suggesting that TWIST decoding makes use of their complementary strengths.Finally, we see a consistent pattern regarding f and g: both TWIST decoding and the reranking baseline perform better when the higher-performing model is chosen as f .(e.g., the domain model performs better in medicine and law, and vice versa in subtitles.)This is expected because f is used both for initial decoding and final decoding with g's guidance (Fig. 1).

Left-to-Right and Right-to-Left Models
Language generation models usually factorize sequences autoregressively in a left-to-right order, but previous work showed that left-to-right (L2R) models can be improved by reranking their outputs with a separate right-to-left (R2L) model (Imamura and Sumita, 2017;Ng et al., 2019;Kiyono et al., 2020, inter alia).TWIST decoding can be readily applied to such scenarios since it does not assume shared generation order between models.
Setups.We experiment with two language pairs from the WMT 2020 news translation task (Barrault et al., 2020): Chinese→English (WMT20 ZH-EN, 48M training sentence pairs) and English→German (WMT20 EN-DE, 48M pairs).Submissions for these language pairs to the shared task have human evaluations from professional translators (Freitag et al., 2021), and the correlation between automatic metrics and the human ratings are studied in subsequent work (Kasai et al., 2022a); COMET (Rei et al., 2020b,a) achieves the highest correlation out of the 15+ metrics.Similar to the previous experiments, we measure all performance using COMET and BLEU scores.Note that we use two reference translations per instance for WMT20 ZH-EN and three for WMT20 EN-DE, following Kasai et al. (2022a).They both have reference translations from two different services, and WMT20 EN-DE has an additional translation created by linguists who are asked to paraphrase the two translations as much as possible.These paraphrased translations are shown to increase correlation with human judgments by mitigating the translationese effect (Graham et al., 2020) and diversifying the reference (Freitag et al., 2020).On each dataset, we follow the preprocessing and tokenization (Koehn et al., 2007;Sennrich et al., 2016b) from Kasai et al. (2022a) 8 and train a large-sized transformer model for left-to-right and right-to-left translation, in which the output English/German sequences are reversed after tokenization.We implement all models and decoding with fairseq and apply beam search with beam size 5 and length penalty 1.We again consider a total of six settings: reranking and TWIST decoding with L2R as f and R2L as g or the reverse, as well as the individual models.Further details can be found in Appendix §A.2.Results.Table 2 shows the results from L2R and R2L translation models.TWIST decoding again outperforms the reranking counterpart by a considerable margin in COMET and BLEU on both language pairs; e.g., 43.1 vs. 41.2COMET points on WMT20 ZH-EN when f is R2L and g is L2R.
The best performance is achieved by TWIST decoding on both datasets and improves over the individual models by more than 1 BLEU point.The reranking baseline, on the other hand, does not outperform L2R in BLEU when f is R2L: 35.4 vs. 35.5 (ZH-EN) and 45.2 vs. 45.5 (EN-DE).This result illustrates that TWIST decoding is a more effective approach to combine models with different generation order than the popular reranking.

Summarization with Different Input
We also experiment with strong models on a highly abstractive scientific paper summarization task: Sc-iTLDR (Cachola et al., 2020).Specifically, we use two BART-based models from prior work (Cachola et al., 2020) that differ in input type: one that only takes as input the paper abstract (Abst.)and the other a concatenation of the abstract, introduction, and conclusion (AIC). 9 Setups.We use the train/dev

Low-Resource Scenarios
In our experiments over four diverse domains ( §3.1), we assumed that plenty of parallel data is available in every domain, and the domain model generally outperformed the generic model.Concretely, we used 1.1M and 720K training sentence pairs for the medical and legal domains, based on the data splits from previous work (Koehn and Knowles, 2017;Hu et al., 2019).In real-world applications, however, these domain-specific translation data are often scarce since they need to be annotated by bilingual speakers with expertise in those domains.The question arises: can a domain model trained on small parallel data still help the generic model by its complementary strengths?To simulate such low-resource scenarios, we randomly sample {10k, 20k, 40k, 80k} sentence pairs and conduct the same evaluations with the generic and domain models as f and g, respectively.Fig. 3 plots COMET scores of various decoding methods on the medical and legal domains.The score from the generic model is constant because we only change the domain training data.There is a striking trend: even though the domain model performs poorly by itself, it improves the generic model through TWIST decoding over varying sizes.Reranking also helps the generic model as the data size increases, but the improvement is less pronounced than that of TWIST decoding.COMET (Rei et al., 2020a) is a regression-based metric that can take negative values.λs are tuned in each case, and we found that as the domain model (g) gets stronger, λ g increases, relative to λ f .This observation is aligned with the intuition that λ g indicates the relative importance of g's guidance.

Analysis
Iterations.So far, we have only applied one iteration of TWIST decoding, but Fig. 4 plots performance over multiple iterations.Iteration 0 signifies f 's initial decoding (y (0) in Fig. 1), and every iteration involves g's decoding with f 's guidance (z (t) ) and its reverse (y (t) ).We observe that the first iteration brings most of the performance gains.This makes TWIST decoding practically appealing, as it improves performance without much increase in the computation or inference time (see below).
Inference Time.Table 4 reports the runtime of each decoding method, relative to f 's decoding in isolation.We use batch size 1 on the same single A100-SXM GPU and measure the wall-clock time from when all models are loaded until all outputs are obtained.As expected, TWIST decoding results in a slowdown compared to decoding in isolation, but the increase in time is only 50%.The inference time for TWIST decoding is much shorter than the sum of f and g in isolation (1.4× vs. 2.1× on medical translation) because 1) the encoder computation for f and g can be parallelized and 2) the encoder computation for f is done only once while we need two runs of f 's decoder.We leave it to future work to further speed up TWIST decoding; since the slowdown of TWIST decoding primarily comes from the decoder, it can be sped up by best-first beam search (Meister et al., 2020b), a deep-encoder, shallow-decoder strategy (Kasai et al., 2021a), or a fast, linear-complexity variant of the transformer decoder (Peng et al., 2021;Kasai et al., 2021b) that is shown to retain the performance of the standard encoder-decoder transformer.
Another approach could be sequence-level knowledge distillation (Kim and Rush, 2016), which has proven successful in speeding up an ensemble translation model (Freitag et al., 2017).Iteration 0 refers to the initial decoding from f .Every iteration consists of g's decoding with f 's guidance followed by f 's decoding with g's guidance.The values of λs are kept the same over all iterations for simplicity.Initially, we explored gradually increasing the λs as f and g's outputs become closer, but we found no substantial performance gain.

Sensitivity Analysis on Distance Coefficients.
As discussed in §2.2, λ f and λ g weight the distance terms from f and g respectively.We tuned λ f and λ g on the dev.set from the range of {0.1, 0.3, 1.0, 3.0}.Fig. 5 visualizes how they affect the overall performance on the dev.sets.λ g > λ f generally yields good performance, suggesting the effectiveness of the initial exploration by g with relatively weaker guidance from f .
Variants of Distance Functions.We experiment with two variants of distance terms (Table 5): 1) one candidate, which measures the distance from the 1-best candidate from the other model (vs.minimization over multiple candidates; §2.2) and 2) embed.distance, which calculates the distance based on the Euclidean distance between the embeddings.
Here the embeddings are taken from the output layer of the decoder.Overall, both variants yield similar performance to the original distance function, but the one candidate method has a substantial performance drop on WMT20 ZH-EN.Note also that the embed.distance method necessitates additional distance computations between the token embeddings.This result illustrates that our original distance function is a simple yet effective design choice.
Examples.Seen in  et al., 2001, 2002;Matusov et al., 2006;Rosti et al., 2007;Sim et al., 2007;Hildebrand and Vogel, 2008).Most of these methods limit their search space to n-best candidates from individual translation models (Li et al., 2009), contrasting with our TWIST decoding where one model can update its translation outputs under the guidance of another model.Collaborative decoding (Li et al., 2009) trains a separate feature-based scorer that measures the consensus between phrase-based Chinese-to-English translation models.Several recent works proposed inference algorithms for decoding from multiple generators for specific tasks, such as detoxification and abductive reasoning (West et al., 2021;Liu et al., 2021).
Alternatives to Left-to-Right Decoding.We showed that TWIST decoding can be used to benefit from models with diverging generation order.Several prior works proposed approaches for generating text in a different fashion than the standard left-to-right order.For example, much recent work explored non-autoregressive generation (Gu et al., 2018;Lee et al., 2018;Mansimov et al., 2019;Ghazvininejad et al., 2019;Kasai et al., 2020, inter alia) primarily to parallelize and speed up inference.More specifically, several works introduced training and/or inference algorithms that combine left-to-right and right-to-left models for machine translation (Zhou et al., 2019) and commonsense inference (Zaidi et al., 2020).Qin et al. (2020)  In placebo-controlled trials, the incidence of akathisia in bipolar patients was 12.1% with aripiprazole and 3.2% with placebo.

Domain
If signs and symptoms of tardive dyskinesia appear in one patient on ABILIFY, a dose reduction or discontinuation should be considered.
In placebo-controlled trials , the incidence of akathisia in bipolar disorder was 12.1% and 3.2% with aripiprazole .

Generic
If a patient treated with ABILIFY shows signs and symptoms of late dyskinesia , it should be considered to reduce the dose or stop treatment.
In placebo-controlled studies , the incidence of akathisia in bipolar patients was 12.1% with aripiprazole and 3.2% with placebo .
TWIST f : Domain g: Generic If signs and symptoms of tardive dyskinesia appear in one patient on ABILIFY, a dose reduction or discontinuation should be considered.
In placebo-controlled trials , the incidence of akathisia in bipolar patients was 12.1% with aripiprazole and 3.2% with placebo .
Table 6: Example outputs from machine translation on the medical domain.For TWIST decoding, f is the domain model, and g is the generic model.In the left section, the generic model fails to capture technical terminology (late dyskinesia vs. tardive dyskinesia for the German term, Spätdyskinesie), and TWIST decoding chooses the correct term of tardive dyskinesia from the domain model.In the right example, the domain model has a problem in coordination ( 12.1% and 3.2% with aripirazole vs. 12.1% with aripirazole and 3.2% with placebo), and TWIST decoding successfully benefits from the accurate translation of the generic model.
on the output representations.Those algorithms are designed specifically for the combination of left-to-right and right-to-left generation and cannot be easily extended to more general situations, such as diverging tokenization and vocabularies where TWIST decoding has been shown effective.

Conclusion
We presented TWIST decoding, a general inference algorithm that generates text from diverse models without the assumption of a shared vocabulary, tokenization, or generation order.Our method enables diverse models to guide each other, thereby outperforming individual models over various scenarios, even when one of the models is much weaker because of limited data.We also demonstrated that TWIST decoding can be viewed as a generalization and improvement of the commonly-adopted reranking heuristic.As it only requires a small change in code, we hope that researchers and practitioners will explore complementary strengths of diverse generation models through TWIST decoding.

Limitations
We evaluated our decoding method that combines generation models both on machine translation and scientific paper summarization over several scenarios: combining 1) generic and domain-specific models, 2) left-to-right and right-to-left generation models, and 3) models that generate using different conditioning inputs.Our machine translation experiments span diverse domains, including medical and legal text.We also presented results from recent English-to-German and Chinese-to-English WMT data.Nonetheless, our domain translation experiments are limited to German-to-English, and we only dealt with scientific papers written in English, mainly due to availability of data.There are also many other language generation tasks for which our method can be useful.Since we opensource our codebase built on top of a popular library, we hope that practitioners will use it for applications of their interest and further assess our decoding algorithm in many application scenarios.
Evaluating language generation remains a challenging research problem.We carefully set up our experiments to mitigate potential evaluation issues.The WMT 2020 test data consist only of news text written in the original language, in contrast to the test data from WMT 2018 (Bojar et al., 2018) or earlier.The WMT 2020 EN→DE and DE→EN test data that we used thus come from completely different documents.This avoids the translationese effect that would overestimate the translation performance due to the simplicity of translated text (Graham et al., 2020).Moreover, the WMT 2020 test data for English-to-German and Chinese-to-English translation have multiple reference translations per instance, which increases the correlation of reference-based, automatic evaluations with human judgment (Kasai et al., 2022a).We presented results using automatic metrics from recent work (Rei et al., 2020b) as well as conventional, n-gram overlap metrics (Papineni et al., 2002;Lin, 2004).Recent automatic metrics have shown to have higher correlation with human judgments, but human judgments are sometimes inconsistent, especially when crowdsourced (Clark et al., 2021;Kasai et al., 2022c).Since our decoding method is a simple modification of the widely-used beam search algorithm, we hope that it will be tested and used in real-world systems of language generation.ing Hassan et al. (2018).Separately from English, BPE with 32K operations is then applied to Chinese.The decoder input and output embeddings are tied.WMT20 EN-DE The same hyperparameters are chosen as in WMT20 ZH-EN (Table 8).We again follow Kasai et al. (2022a)

A.3 SciTLDR
We use two BART-based pretrained models from Cachola et al. (2020): the abstract-only version of BART and the AIC version of CATTS XSUM .14These two models are both BART-based models; CATTS XSUM is obtained by finetuning BART on the XSUM dataset (Narayan et al., 2018) with multitask scaffolding (Cachola et al., 2020).

A.4 λ Tuning
We tune λ f and λ g from {0.1, 0.3, 1.0, 3.0}, based on the dev.BLEU/ROUGE-L score on machine translation and paper summarization, respectively.Table 9 reports the selected λ values in all scenarios.

Figure 3 :
Figure 3: Results when parallel data are scarce in the target domain.Both TWIST decoding and reranking use the generic model as f and the domain model as g.COMET(Rei et al., 2020a) is a regression-based metric that can take negative values.λs are tuned in each case, and we found that as the domain model (g) gets stronger, λ g increases, relative to λ f .This observation is aligned with the intuition that λ g indicates the relative importance of g's guidance.

Figure 4 :
Figure4: Effects of iterations on dev.performance.Iteration 0 refers to the initial decoding from f .Every iteration consists of g's decoding with f 's guidance followed by f 's decoding with g's guidance.The values of λs are kept the same over all iterations for simplicity.Initially, we explored gradually increasing the λs as f and g's outputs become closer, but we found no substantial performance gain.

Figure 5 :
Figure 5: Dev.set performance measured in the COMET score (Rei et al., 2020a,b) with varying λ f and λ g .See Appendix §B for other configurations.
incorporated right (future) context into a left-to-right language model by iterative gradient-based updates symptoms of tardive dyskinesia appear in a patient on ABILIFY, dose reduction or discontinuation should be considered.

Table 1 :
(Ng et al., 2019)neric and domain-specific translation models.The generic model is the top-performing translation model in WMT19(Ng et al., 2019)that is trained on a collection of parallel corpora, such as the Europarl and the UN corpora.Two settings are considered for the reranking baseline and TWIST decoding: f is the generic model and g is the domain model or the reverse.The best scores are in bold.COMET(Rei et al., 2020a,b)uses crosslingual contextual representations

Table 4 :
Inference time relative to a single model decoded in isolation.It is measured on the same single Nvidia A100-SXM GPU with batch size 1.We measure the wall-clock time from when the models are loaded until the last sentence is translated on the test data.

Table 5 :
Variants of the distance function in TWIST decoding.f is the domain model and g is the generic model for medical translation (German→English).f is R2L and g is L2R for WMT20 Chinese→English.
Table 6 are example German→English translations from the medical domain.The left section presents a case where the domain model translates the technical term,Spätdyskinesie, into the corresponding English term: tardive dyskinesia.The generic model, on the other hand, generates a literal translation: late dyskinesia.In the right section, the domain model fails to handle the coordinate structure: 12.1% and 3.2% with aripiprazole vs. 12.1% with aripiprazole and 3.2% with placebo.Further, the final output has wording closer to the reference translation: trials vs. studies and bipolar patients vs. bipolar disorder.These examples illustrate that TWIST decoding benefits from the complementary strengths of the domain and generic models.

Table 8 :
Vaswani et al. (2017)nglish and German text by the Moses tokenizer and joint BPE with 32K operations.All embeddings are shared.L2R and R2L translation fairseq hyperparameters and setting.We generally follow the large-sized configuration fromVaswani et al. (2017).