GWLAN: General Word-Level AutocompletioN for Computer-Aided Translation

Computer-aided translation (CAT), the use of software to assist a human translator in the translation process, has been proven to be useful in enhancing the productivity of human translators. Autocompletion, which suggests translation results according to the text pieces provided by human translators, is a core function of CAT. There are two limitations in previous research in this line. First, most research works on this topic focus on sentence-level autocompletion (i.e., generating the whole translation as a sentence based on human input), but word-level autocompletion is under-explored so far. Second, almost no public benchmarks are available for the autocompletion task of CAT. This might be among the reasons why research progress in CAT is much slower compared to automatic MT. In this paper, we propose the task of general word-level autocompletion (GWLAN) from a real-world CAT scenario, and construct the first public benchmark to facilitate research in this topic. In addition, we propose an effective method for GWLAN and compare it with several strong baselines. Experiments demonstrate that our proposed method can give significantly more accurate predictions than the baseline methods on our benchmark datasets.


Wir haben die Meinung von zwei Fachärzten eingeholt.
We asked two experts for their opinion.
We sp their opinion. 2009). In spite of this, MT systems cannot replace human translators, especially in the scenarios with rigorous translation quality requirements (e.g., translating product manuals, patent documents, government policies, and other official documents). Therefore, how to leverage the pros of MT systems to help human translators, namely, Computer-aided translation (CAT), attracts the attention of researchers (Barrachina et al., 2009;Green et al., 2014;Knowles and Koehn, 2016;Santy et al., 2019). Among all CAT technologies (such as translation memory, terminology management, sample sentence search, etc.), autocompletion plays an important role in a CAT system in enhancing translation efficiency. Autocompletion suggests translation results according to the text pieces provided by human translators.
We note two limitations in previous research on the topic of autocompletion for CAT. First, most of previous studies aim to save human efforts by sentence-level autocompletion (Figure 1 a). Nevertheless, word-level autocompletion (Figure 1 b and c) has not been systematically studied. Second, almost no public benchmarks are available for the autocompletion task of CAT. Although some achievements have been made, research progress in CAT is more sluggish than that in automatic MT. The lack of benchmarks has hindered researchers from making continuous progress in this area.
In this work, we propose a General Word-Level AutocompletioN (GWLAN) task, and construct a benchmark with automatic evaluation to facilitate further research progress in CAT. Specifically, the GWLAN task aims to complete the target word for human translators based on a source sentence, translation context as well as human typed characters. Compared with previous work, GWLAN considers four most general types of translation context: prefix, suffix, zero context, and bidirectional context. Besides, as in most real world scenarios, we only know the relative position between input words and the spans of translation context in the GWLAN task. We construct a benchmark for the task, with the goal of supporting automatic evaluation and ensuring a convenient and fair comparison among different methods. The benchmark is built by extracting triples of source sentences, translation contexts, and human typed characters from standard parallel datasets. Accuracy is adopted as the evaluation metric in the benchmark.
To address the variety of context types and weak position information issue, we propose a neural model to complete a word in different types of context as well as a joint training strategy to optimize its parameters. Our model can learn the representation of potential target words in translation and then choose the most possible word based on the human input.
Our contributions are two-fold: • We propose the task of general word-level autocompletion for CAT, and construct the first public benchmark to facilitate research in this topic.
• We propose a joint training strategy to optimize the model parameters on different types of contexts together. 2 As the the MT systems advanced and improved, various efficient interaction ways of CAT have emerged (Vasconcellos and León, 1985;Green et al., 2014;Hokamp and Liu, 2017;Weng et al., 2019;Wang et al., 2020). Among those different methods, the autocompletion is the most related to our work. Therefore, we will first describe previous works in both sentence-level and word-level autocompletion, then show the relation to other tasks and scenarios.
Sentence-level Autocompletion Most of previous work in autocompletion for CAT focus on sentence-level completion. A common use case in this line is interactive machine translation (IMT) (Green et al., 2014;Cheng et al., 2016;Peris et al., 2017;Knowles and Koehn, 2016;Santy et al., 2019). IMT systems utilize MT systems to complete the rest of a translation after human translators editing a prefix translation (Alabau et al., 2014;Zhao et al., 2020). For most IMT systems, the core to achieve this completion is prefix-constrained decoding (Wuebker et al., 2016). Another sentence-level autocompletion method, lexically constrained decoding (LCD) (Hokamp and Liu, 2017;Post and Vilar, 2018), recently attracts lots of attention (Hasler et al., 2018;Susanto et al., 2020;Kajiwara, 2019). Compared with IMT, LCD relaxes the constraints provided by human translators from prefixes to general forms: LCD completes a translation based on some unordered words (i.e., lexical constraints), which are not necessary to be continuous (Hokamp and Liu, 2017;Hu et al., 2019;Dinu et al., 2019;Song et al., 2019). Although it does not need additional training, its inference is typically less efficient compared with the standard NMT. Therefore, other works propose efficient methods Song et al., 2019) by using lexical constraints in a soft manner rather than a hard manner as in LCD.
Word-level Autocompletion Word-level autocompletion for CAT is less studied than sentencelevel autocompletion. Langlais et al. (2000); Santy et al. (2019) consider to complete a target word based on human typed characters and a translation prefix. But they require the target word to be the next word of the translation prefix, which limits its application. In contrast, in our work the proposed word-level autocompletion is more general and can be applied to real-world scenarios such as post-editing (Vasconcellos and León, 1985;Green et al., 2013) and LCD, where human translators need to input some words (corrections or constraints). Huang et al. (2015) propose a method to predict a target word based on human typed characters, however, this method only uses the source side information and does not consider translation context, leading to limited performance compared with our work.
Others Our work may also be related to previous works in input method editors (IME) (Huang et al., 2018;Lee et al., 2007). However, they are in the monolingual setting and not capable of using the useful multilingual information.

Task and Benchmark
In this section, we first describe why we need wordlevel autocompletion in real-world CAT scenarios. We then present the details of the GWLAN task and the construction of benchmark.
Why GWLAN? Word level autocompletion is beneficial for improving input efficiency (Langlais et al., 2000). Previous works assume that the translation context should be a prefix and the desired word is next to the prefix as shown in Figure 1 (b), where the context is "We asked two" and the desired word is "specialists". However, in some real-world CAT scenarios such as post-editing and lexically constrained decoding, translation context may be discontinuous and the input words (corrections or lexical constraints) are not necessarily conjunct to the translation context. As shown in Figure 1 (c), the context is "We · · · their opinion" and the human typed characters "sp" is conjunct to neither "We" nor "their" in the context. Therefore, existing methods can not perform well on such a general scenario. This motivates us to propose a general word-level autocompletion task for CAT.

Task Definition
Suppose x = (x 1 , x 2 , . . . , x m ) is a source sequence, s = (s 1 , s 2 , . . . , s k ) is a sequence of human typed characters, and a translation context is denoted by c = (c l , c r ), where c l = (c l,1 , c l,2 , . . . , c l,i ), c r = (c r,1 , c r,2 , . . . , c r,j ). The translation pieces c l and c r are on the left and right hand side of s, respectively. Formally, given a source sequence x, typed character sequence s and a context c, the general word-level autocompletion (GWLAN) task aims to predict a target word w which is to be placed in the middle be-tween c l and c r to constitute a partial translation. Note that in the partial translation consisting of c l , w and c r , w is not necessary to be consecutive to c l,i or c r,1 . For example, in Figure 1 (c), c l = ("We", ), c r = ("their", "option", "."), s = ("sp", ), the GWLAN task is expected to predict w = "specialists" to constitute a partial translation "We · · · specialists · · · their opinion.", where "· · · " represents zero, one, or more words (i.e., the two words before and after it are not necessarily consecutive).
To make our task more general in real-world scenarios, we assume that the left context c l and right context c r can be empty, which leads to the following four types of context: • Zero-context: both c l and c r are empty; • Suffix: c l is empty; • Prefix: c r is empty; • Bi-context: neither c l nor c r is empty. With the tuple (x, s, c), the GWLAN task is to predict the human desired word w.
Relation to most similar tasks Some similar techniques have been explored in CAT. Green et al. (2014) and Knowles and Koehn (2016) studied a autocompletion scenario called translation prediction (TP), which provides suggestions of the next word (or phrase) given a prefix. Besides the strict assumption of translation context (i.e., prefix here), compared with GWLAN, another difference is that the information of human typed characters is ignored in their setting. There also exist some works that consider the human typed sequences (Huang et al., 2015;Santy et al., 2019), but they only consider a specific type of translation contexts. Huang et al. (2015) propose to complete a target word based on the zero-context assumption. Despite its flexibility, this method is unable to explore translation contexts to improve the autocompletion performance. The word-level autocompletion methods in Langlais et al. (2000); Santy et al. (2019) have the same assumption as TP, which impedes the use of their methods under the scenarios like post editing and lexically constrained decoding, where human inputs are not necessarily conjunct to the variety of translation contexts.

Benchmark Construction
To set up a benchmark, firstly we should create a large scale dataset including tuples of (x, s, c, w) for training and evaluating GWLAN models. Ideally, we may hire professional translators to man-ually annotate such a dataset, but it is too costly in practice. Therefore, in this work, we propose to automatically construct the dataset from parallel datasets which is originally used in automatic machine translation tasks. The procedure for constructing our data is the same for train, validation, and test sets. And we construct a dataset for each type of translation context.
Assume we are given a parallel dataset {(x i , y i )}, where y i is the reference translation of x i . Then, we can automatically construct the data c i and s i by randomly sampling from y i . We first sample a word w = y i k and then demonstrate how to extract c i for different translation contexts: • Zero-context: both c l and c r are empty; • Suffix: randomly sample a translation piece c r = y p r,1 :p r,2 from y, where k < p r,1 < p r,2 . The c l is empty here; • Prefix: randomly sample a translation piece c l = y p l,1 :p l,2 from y, where p l,1 < p l,2 < k. The c r is empty here; • Bi-context: sample c l as in prefix, and sample c r as in suffix.
Then we have to simulate the human typed characters s based on w. For languages like English and German, we sample a position p from the character sequence and the human input s = w 1:p , where 1 ≤ p < L w . For languages like Chinese, the human input is the phonetic symbols of the word, since the word cannot be directly typed into the computer. Therefore, we have to convert w to phonetic symbols that are characters in alphabet and sample s from phonetic symbols like we did on English.
Evaluation Metric To evaluate the performance of the well-trained models, we choose accuracy as the evaluation metric: where N match is the number of words that are correctly predicted and N all is the number of testing examples.

Proposed Approach
Given a tuple (x, c, s), our approach decomposes the whole word autocompletion process into two parts: model the distribution of the target word w based on the source sequence x and the translation context c, and find the most possible word w based on the distribution and human typed sequence s. Therefore, in the following subsections, we firstly propose a word prediction model (WPM) to define the distribution p(w|x, c) of the target word w ( §4.1). Then we can treat the human input sequence s as soft constraints or hard constraints to complete s and obtain the target word w ( §4.2). Finally, we present two strategies for training and inference ( §4.3).

Word Prediction Model
The purpose of WPM is to model the distribution p(w|x, c). More concretely, we will use a single placeholder [MASK] to represent the unknown target word w, and use the representation of [MASK] learned from WPM to predict it. Formally, given the source sequence x, and the translation context c = (c l , c r ), the possibility of the target word w is: where h is the representation of [MASK], φ is a linear network that projects the hidden representation h to a vector with dimension of target vocabulary size V , and softmax(d)[w] takes the component regarding to w after the softmax operation over a vector d ∈ R V .
Inspired by the attention-based architectures (Vaswani et al., 2017;Devlin et al., 2019) < l a t e x i t s h a 1 _ b a s e 6 4 = " c X N t n B I N X c m w h k w / Y L G 2 V t 3 d 2 9 / Y r B 4 c t E 6 e a 8 S a L Z a w 7 A T V c C s W b K F D y T q I 5 j Q L J 2 8 H 4 N v f b T 1 w b E a s H n C T c j + h Q i V A w i l Z q 9 Q Y x m n K / U n V r 7 g x k m X g F q U K B R r / y Z Y M s j b h C J q k x X c 9 N 0 M + o R s E k n 5 Z 7 q e E J Z W M 6 5 F 1 L F Y 2 4 8 b P Z t l N y a p U B C W N t n 0 I y U 3 8 n M h o Z M 4 k C O x l R H J l F L x f / 8 7 o p h t d + J l S S I l d s / l G Y S o I x y U 8 n A 6 E 5 Q z m x h D I t 7 K 6 E j a i m D G 1 B e Q n e 4 s n L p H V e 8 y 5 r 7 v 1 F t X 5 T 1 F G C Y z i B M / D g C u p w B w 1 o A o N H e I Z X e H N i 5 8 V 5 d z 7 m o y t O k T m C P 3 A + f w A o 8 4 7 a < / l a t e x i t > . . .

aircraft [MASK]
< l a t e x i t s h a 1 _ b a s e 6 4 = " c X N t n B I N X c m w h k w K N 8 r h F M u 8 C H E = " > A A A B 7 X i c b V D L S s N A F L 3 x W e u r 6 t L N Y B F c l U Q U X R b d u K x g H 9 C G M p l O 2 r G T S Z i 5 E U r o P 7 h x o Y h b / 8 e d f + O k z U J b D w w c z r m H u f c E i R Q G X f f b W V l d W 9 / Y L G 2 V t 3 d 2 9 / Y r B 4 c t E 6 e a 8 S a L Z a w 7 A T V c C s W b K F D y T q I 5 j Q L J 2 8 H 4 N v f b T 1 w b E a s H n C T c j + h Q i V A w i l Z q 9 Q Y x m n K / U n V r 7 g x k m X g F q U K B R r / y Z Y M s j b h C J q k x X c 9 N 0 M + o R s E k n 5 Z 7 q e E J Z W M 6 5 F 1 L F Y 2 4 8 b P Z t l N y a p U B C W N t n 0 I y U use a dual-encoder architecture to learn the representation h based on source sequence x and translation context c. Our model has a source encoder and a cross-lingual encoder. The source encoder of WPM is the same as the Transformer encoder, which is used to encode the source sequence x. As shown in Figure 2, the output of source encoder will be passed to the cross-lingual encoder later. The cross-lingual encoder is similar to the Transformer decoder, while the only difference is that we replace the auto-regressive attention (ARA) layer by a bidirectional masked attention (BMA) module, due to that the ARA layer cannot use the leftward information flow (i.e., c r ). Specifically, the BMA module is built by a multiple-layer self attention network. As shown in Figure 3, in each layer of BMA, each token in the attention query can attend to all words in translation context c l and c r . In addition, the input consists of three parts, the [MASK] token, and translation contexts c l and c r , as illustrated in Figure 3. Note that its position embeddings E are only used to represent the relative position relationship between tokens. Taking the sentence in Figure 3 as an example, E 3 does not precisely specify the position of the target word w but roughly indicates that w is on the right-hand-side of c l and on the left-hand-side of c r . Finally, the representation of [MASK] as learnt by BMA will be passed to Add & Norm layer as shown in Figure 2.

Human Input Autocompletion
After learning the representation h of the [MASK] token, there are two ways to use the human input sequence s to determinate the human desired word. Firstly, we can learn the representation of s and use it as a soft constraint while predicting word w. Taking the sentence in Figure 3 as an example, supposing the human typed sequence is s = "des", we can use an RNN network to learn the representation of des and concatenate it with h to predict the word descending. Alternatively, we can use des as a hard constraint: where P (·|·) is the probability distribution defined in Eq.
(2) and Z is the normalization term independent on w. Then we pick w * = arg max w P s [w] as the most possible word. In our preliminary experiments, the performances of these two methods are comparable, and there is no significant gain when we use them together. One main reason is that the model can already learn the starts-with action precisely in the soft constraint method. Therefore, we propose to use the human inputs as hard constraints in our later experiments, because of the method's efficiency and simplicity.

Training and Inference Strategy
Suppose D denotes the training data for GWLAN, i.e., a set of tuples (x, c, s, w). Since there are four different types of context in D as presented in §3, we can split D into four subsets D zero , D prefix , D suffix and D bi . To yield good performances on those four types of translation context, we also propose two training strategies. The inference strategy differs accordingly.
Strategy 1: One Context Type One Model For this strategy, we will train a model for each translation context, respectively. Specifically, for each type of context t ∈ {zero, prefix, suffix, bi}, we independently train one model θ t by minimizing the following loss L(D t , θ): where P (w|x, c; θ) is the WPM model defined in Eq. 2, |D t | is the size of training dataset D t , and t can be any type of translation context. In this way, we actually obtain four models in total after training. In the inference process, for each testing instance (x, c l , c r , s), we decide its context type t in terms of c l and c r and then useθ t to predict the word w.
Strategy 2: Joint Model The separate training strategy is straightforward. However, it may also make the models struck in the local optimal. To address these issues, we also propose a joint training strategy, which has the ability to stretch the model out of the local optimal once the parameters is over-fitting on one particular translation context. Therefore, using the joint training strategy, we train a single model for all types of translation context by minimizing the following objective: where each L(D t ; θ) is as defined in Eq. 3. In this way, we actually obtain a single modelθ after training. In the inference process, for each testing instance (x, c l , c r , s) we always useθ to predict the target word w.

Datasets
We carry out experiments on four GWLAN tasks including bidirectional Chinese-English tasks and German-English tasks. The benchmarks for our experiments are based on the public translation datasets. The training set for two directional Chinese-English tasks consists of 1.25M bilingual sentence pairs from LDC corpora. The toolkit we used to convert Chinese word w to phonetic symbols is pypinyin 4 . As discussed in ( §3.2), the training data for GWLAN is extracted from 1.25M sentence pairs. The validation data for GWLAN is extracted from NIST02 and the test datasets for GWLAN are constructed from NIST05 and NIST06. For two directional German-English tasks, we use the WMT14 dataset preprocessed by Stanford 5 . The validation and test sets for our tasks are based on newstest13 and newstest14 respectively. For each dataset, the models are tuned and selected based on the validation set.
The main strategies we used to prepare our benchmarks are shown in §3.2. However, lots of trivial instances may be included if we directly use the uniform distribution for sampling, e.g., predicting word "the" given "th". Therefore, we apply some intuitive rules to reduce the probability of trivial instances. For example, we assign higher probability for words with more than 4 characters in English and 2 characters in Chinese, and we require that the lengths of input character sequence s and translation contexts c should not be too long.

Systems for Comparison
In the experiments, we evaluate and compare the performance of our methods (WPM-Sep and WPM-Joint) and a few baselines. They are illustrated below, WPM-SEP is our approach with the "one context one model" training and inference strategy in Section §4.3. In other words, we train our model for each translation context separately.
WPM-JOINT is our approach with the "joint model" strategy in Section §4.3.

TRANSTABLE:
We train an alignment model 6 on the training set and build a word-level translation table. While testing, we can find the translations of all source words based on this table, and select out valid translations based on the human input. The word with highest frequency among all candidates is regarded as the prediction. This baseline is inspired by Huang et al. (2015).

TRANS-PE:
We train a vanilla NMT model using the Transformer-base model. During the inference process, we use the context on the left hand side of human input as the model input, and return the most possible words based on the probability of valid words selected out by the human input. This baseline is inspired by Langlais et al. (2000); Santy et al. (2019).

TRANS-NPE:
As another baseline, we also train an NMT model based on Transformer, but without position encoding on the target side. While testing, we use the averaged hidden vectors of all the target words outputted by the last decoder layer to predict the potential candidates. Table 1 shows the main results of our methods and three baselines on the test sets of Chinese-English and German-English datasets. It is clear from the results that our methods WPM-SEP and WPM-JOINT significantly outperform the three baseline methods. Results on Row 4 and Row 5 of Table  1 also show that the WPM-JOINT method, which uses a joint training strategy to optimize a single model, achieves better overall performance than WPM-SEP, which trains four models for different translation contexts respectively. In-depth analysis about the two training strategies is presented in the next section. The method TRANS-PE, which assumes the human input is the next word of the given context, behaves poorly under the more general setting. As the results of TRANS-NPE show, when we use the same model as TRANS-PE and relax the constraint of position by removing the position encoding, the accuracy of the model improves. One interesting finding is that the TRANSTABLE method, which is only capable of leveraging the zero-context, achieves good results on the Chinese-English task when the target language is English. However, when the target language is Chinese, the performance of TRANSTABLE drops significantly.

Main Results
6 Experimental Analysis

Effects on Different Translation Context
In this section, we presents more detailed results on the four translation contexts and analyze the features of GWLAN. These analyses can help us to better understand the task and propose effective approaches in the future.
Separate Training VS. Joint Training Compared with WPM-SEP, WPM-JOINT shows two advantages. On one hand, even there is only one model, WPM-JOINT yields better performances than WPM-SEP, enabling simpler deployment. This may be caused by that training on multiple related tasks can force the model learn more expressive representations, avoiding over-fitting. On the other hand, the variance of results on different translation contexts of WPM-JOINT is smaller, which can provide an more steady autocompletion service. From the viewpoint of joint training, the lower variance may be caused by that WPM-JOINT spends more efforts to minimize the one with maximal risk (i.e., zero-context), although sometimes it may slightly sacrifice the task with minimal risk (i.e., bi-context).
The results of WPM-SEP and WPM-JOINT also have some shared patterns. Firstly, the performances of the two methods on prefix and suffix translation contexts are nearly the same. Although the prefix and suffix may play different roles in the SVO language structure, they have little impact on the the autocompletion accuracy using our method. Moreover, among the results on four translation contexts, the performances on bi-context are better than prefix and suffix, and prefix and suffix are better than zero-context. This finding shows that more context information can help to reduce the uncertainty of human desired words.
Comparison with baselines The TRANS-PE method in previous works is more sensitive to the position of human input. The statistical results shows that the averaged distances in the original sentence between the prediction words and translation contexts are various for different translation contexts, which are 7.4, 6.5, 14.1, and 3.2 for prefix, suffix, zero-context, and bi-context, respectively. When the desired words are much closer to the context, TRANS-PE can achieve better performances. Moreover, TRANS-PE can achieve more than 80 accuracy scores when the prediction word is the next word of the given prefix, however, its performance drops significantly when the word is not necessarily conjunct to the prefix. We can also find that TRANS-NPE, which removes the position information of target words, achieves better overall performances compared with TRANS-PE.
In contrast, the performance of TRANSTABLE is less affected by the position of the prediction words, which is demonstrated by the low variances on both tasks in Table 2. The results of TRANSTA-BLE have also surprised us, which achieves more than 41 accuracy scores on the Zh⇒En task. This observation shows the importance of alignment and the potential of statistical models. Compared with the results on the Zh⇒En task, the overall accu- Figure 4: Robustness Analysis. The x-axis represents the percentage of words that have been replaced by noise tokens in NIST02. The model used for this analysis is the WPM-JOINT, which is trained on the Zh⇒En task without noisy translation context. racy on En⇒Zh task is much lower, likely due to that the number of valid words after filtered by the human input on Chinese is much more than that on English. Therefore, it is easier for TRANSTABLE to determine the human desired words in English.

Robustness on Noisy Contexts
In this work, the translation contexts are simulated using the references. However, in real-world scenarios, translation contexts may not be perfect, i.e., some words in the translation contexts may be incorrect. In this section, we evaluate the robustness of our model on noisy contexts. We first use the translation table constructed by TRANSTABLE to find some target words that share the same source words with the original target words, and then use those found words as noise tokens.
The robustness results are shown in Figure 4. For all the translation context types except for zerocontext, the performance drops slowly when the percentage of noise tokens increases. However, even with 80% words in the context, the performance of WPM-JOINT outperforms the case of zero-context, which shows that our WPM-JOINT method is noise tolerant.

Discussion
In this work, we formalize the task as a classification problem. However, the generation formalization also deserves to be explored in the future. For example, the generation may happen in two circumstances: word-level completion based on subwords, and phrase-level completion. In the first case, although the autocompletion service provided for human translators is word-level, in the internal system we can generate a sequence of subwords (Sennrich et al., 2015) that satisfy the human typed characters, and provide human translators with the merged subwords. This subword sequence generation can significantly alleviate the OOV issue in the word-level autocompletion. In the phrase-level autocompletion case, if we can predict more than one desired words, the translation efficiency and experience may be improved further. We would like to leave it as future work.
It is also worth noting that we did not conduct human studies in this work. We think evidences in previous work can already prove the effectiveness of word-level autocompletion when assisting human translators. For example, TransType (Langlais et al., 2000) is a simple rule-based tool that only considers the prefix context, but the majority of translators said that TransType improved their typing speed a lot. Huang et al. (2015) hired 12 professional translators and systematically evaluate their word autocompletion tool based on zero-context. Experiments show that the more keystrokes are reduced, the more time can be saved for translators. Since the prediction accuracy is highly correlated with the keystrokes, we think higher accuracy will make translators more productive. That is the main reason that we use accuracy to automatically evaluate the model performance. Besides, the automatic evaluation metric also makes the GWLAN task easier to follow.

Conclusion
We propose a General Word-Level Autocomple-tioN (GWLAN) task for computer-aided translation (CAT). In our setting, we relax the strict constraints on the translation contexts in previous work, and abstract four most general translation contexts used in real-world CAT scenarios. We propose two approaches to address the variety of context types and weak position information issues in GWLAN. To support automatic evaluation and to ensure a convenient and fair comparison among different methods, we construct a benchmark for the task. Experiments on this benchmark show that our method outperforms baseline methods by a large margin on four datasets. We believe that this benchmark to be released will push forward future research in CAT.