Quantifying the Influence of MT Output in the Translators’ Performance: A Case Study in Technical Translation

This paper presents experiments on the use of machine translation output for technical translation. MT output was used to produced translation memories that were used with a commercial CAT tool. Our experiments investigate the impact of the use of different translation memories containing MT output in translations’ quality and speed compared to the same task without the use of translation memory. We evaluated the performance of 15 novice translators translating technical English texts into German. Results suggest that translators are on average over 28% faster when us-ing TM.


Introduction
Professional translators use a number of tools to increase the consistency, quality and speed of their work. Some of these tools include spell checkers, text processing software, terminological databases and others. Among all tools used by professional translators the most important of them nowadays are translation memory (TM) software. TM software use parallel corpora of previously translated examples to serve as models for new translations. Translators then validate or correct previously translated segments and translate new ones increasing the size of the memory after each new translated segment.
One of the great issues in working with TMs is to produce the TM itself. This can be time consuming and the memory should ideally contain a good amount of translated segments to be considered useful and accurate. For this reason, many novice translators do not see the benefits of the use of TM right at the beginning, although it is consensual that on the long run the use of TMs increase the quality and speed of their work. To cope with this limitation, more TM software have provided interface to machine translation (MT) software. MT output can be used to suggest new segments that were not previously translated by a human translator but generated automatically from an MT software. But how helpful are these translations?
To answer this question, the experiments proposed in this paper focus on the translator's performance when using TMs produced by MT output within a commercial CAT tool interface. We evaluate the quality of the translation output as well as the time and effort taken to accomplish each task. The impact of MT and TM in translators' performance has been explored and quantified in different settings (Bowker, 2005;Guerberof, 2009;Guerberof, 2012;Morado Vazquez et al., 2013). We believe this paper constitutes another interesting contribution to the interface between the study of the performance of human translators, CAT tools and machine translation.

Related Work
CAT tools have become very popular in the last 20 years. They are used by freelance translators as well as by companies and language service providers to increase translation's quality and speed (Somers and Diaz, 2004;Lagoudaki, 2008). The use of CAT tools is part of the core curriculum of most translation studies degrees and a reasonable level of proficiency in the use of these tools is expected from all graduates. With the improvement of state-of-the-art MT software, a recent trend in CAT research is its integration with machine translation tools as for example the Mate-Cat 1 project (Cettolo et al., 2013).
There is considerable amount of studies on MT post-editing published in the last years (Specia, 2011;Green et al., 2013). Due to the scope of our paper (and space limitation) we will deliberately not discuss the findings of these experiments and instead focus on those that involve the use of translation memories. Post-editing tools are substantially different than commercial CAT tools (such as the one used here) and even though the TMs used in our experiments were produced using MT output, we believe that our experiment setting has more in common with similar studies that investigate TMs than MT post-editing.
The study by Bowker (2005) was one of the first to quantify the influence of TM in translators work. The experiment divided translators in three groups: A, B and C. Translators in Group A did not use a TM, translators in Group B used an unmodified TM and finally translators in group C used a TM that had been deliberately modified with a number of translation errors. The study concluded that when faced with time pressure, translators using TMs tend not to be critical enough about the suggestions presented by the software.
Another similar experiment (Guerberof, 2009) compared productivity and quality of human translations using MT and TM output. The experiment was conducted starting with the hypothesis that the time invested in post-editing one string of machine translated text will correspond to the same time invested in editing a fuzzy matched string located in the 80-90 percent range. This study quantified the performance of 8 translators using a post-editing tool. According to the author, the results indicate that using a TM with 80 to 90 fuzzy matches produces more errors than using MT segments or human translation.
The aforementioned recent work by Morado Vazquez et al. (2013) investigates the performance of twelve human translators (students) using the ACCEPT post-editing tool. Researchers provided MT and TM output and compared time, quality and keystroke effort. Findings of this study indicate that the use of a specific MT has a great impact in the translation activity in all three aspects. In the context of software localization, productivity was also tested by Plitt and Masselot (2010) combining MT output and a post-editing tool. Another study compared the performance of human translators in a scenario using TMs and a commercial CAT tool (Across) with a second scenario using post-editing (Läubli et al., 2013).
As to our study, we used instead of a post-editing tool, a commercial CAT tool, the SDL Trados Studio 2014 version. A similar setting to ours was explored by Federico et al. (2012) using SDL Trados Studio integrating a commercial MT software. We took the decision of working a commercial CAT tool for two reasons: first, because this is the real-world scenario faced by translators in most companies and language service providers 2 and second, because it allows us to explore a different variable that the aforementioned studies did not substantially explore, namely: MT output as TM segments.

Setting the Experiment
In our experiments we provided short texts from the domain of software development containing up to 343 tokens each to 15 beginner translators. The average length of these texts ranges between 210 tokens in experiment 1 to 264 tokens in experiment 3 divided in 15 to 17 segments (average) (see table 2). Translators were given English texts and were asked to translate them into German, their mother tongue. One important remark is that all 15 participants were not aware that the TMs we made available were produced using MT output.
The 15 translators who participated in these experiments are all 3 rd semester master degree students who have completed a bachelors degree in translation studies and are familiar with CAT tools. All of them attended at least 20 class hours about TM software and related technologies. Translators who participated in this study were all proficient in English and they have studied it as a foreign language at bachelor level.
As previously mentioned, the CAT tool used in these experiments is the most recent version of SDL Trados, the Studio 2014 3 version. Translators were given three different short texts to be translated in three different scenarios: 1. Using no translation memory.

Using a translation memory collected with
modified MT examples.

Using translation memory collected with unmodified MT examples.
In experiment number two we performed a number of modifications in the TM segments. As can be seen in table 1, these modifications were sufficient to alter the coverage of the TM, but did not introduce translation errors to the memory. 4 The alterations we performed along with an example of each of them can be summarized as follows: • Deletion: 'To paste the text currently in the clipboard, use the Edit Paste menu item.' -'To paste the text, use the Edit Paste menu item.' • Modification: 'Persistent Selection is disabled by default.' -'Persistent Selection is enabled by default.' • Substitution: 'The editor is composed of the following components:' -'The editor is composed of the following elements:' Three texts were available per scenario, each of them with different TM coverage scores (see table  1). Students were asked to translate the texts at their own pace without time limitation and were allowed to use external linguistic resources such as dictionaries, lexica, parallel concordancers, etc.

Corpus and TM
The corpus used for these experiments is the KDE corpus obtained from the Opus 5 repository (Tiedemann, 2012). The corpus contains texts from the domain of software engineering, hence the title: 'a case study in technical translation'. We are convinced that technical translation contains a substantial amount of fixed expressions and technical terms different from, for example, news texts. This makes technical translation, to our understanding, an interesting domain for the use of TM by professional translators and for experiments of this kind.
In scenarios 1, 2 and 3 we measured different aspects of translation such as time and edited segments. One known shortcoming of our experiment design is that unlike most post-editing software the reports available in CAT tools are quite poor (e.g. no information about keystrokes is provided). Even so, we stick to our decision of using a TM software and tried to compensate this shortcoming by a careful qualitative and quantitative data analysis after the experiments. 4 Modifications were carried out in the source and target languages 5 http://opus.lingfil.uu.se/

Results
We observed performance gain when using any of the two TMs, which was expectable. The results varied according to the coverage of the TM. In experiment number 3, texts contained on average over 7 segments with 100% matches 6 and experiment number 2 only 2.68. This allowed translators to finish the task faster in experiment number 3. The average results obtained in the different experiments are presented in   Apart from the expectable performance gain when using TM, we also found a considerable difference between the use of the modified and unmodified TM. Translators completed segments in experiment number 3, on average, 33.77% faster than experiment two. The difference of coverage between the two TMs was 4,93%, which suggests that a few percentage points of TM coverage results on a greater performance boost.
We also have to acknowledge that the experiments were carried out by translators in the same order in which they are presented in this paper. This may, of course, influence performance in all three experiments as translators were more used to the task towards the end of the experiment. One hypothesis is that the poor performance in experiment 1, could be improved if this task was done for last and conversely, the performance boost observed in experiment 3, could be a bit lower if this experiment was done first. This variable was not explored in similar productivity studies such as those presented in section two and, to our understanding, inverting the order of tasks could be an interesting variable to be tested in future experiments.
As a general remark, although all translators had experience with the 2014 version of Trados Studio, we observed a great difficulty in performing simple tasks with Windows for at least half of the group. Simple operations such as copying, renaming and moving files or creating folders in the file system were very time consuming. Trados interface also posed difficulties to translators. For example, the generation of reports through batch tasks in a different window was for most translators confusing. These operations could be simplified as it is in other CAT tools such as memoQ. 8

A Glance at Quality Estimation
One of the future directions that this work will take is to investigate the quality of human translations. Our initial hypothesis is that it is possible to apply state-of-the-art metrics such as BLEU (Papineni et al., 2002) or METEOR (Denkowski and Lavie, 2011) to estimate the quality of these translations regardless of how they are produced.
The most frequently used one is IBM BLEU (Papineni et al., 2002). It is easy to use, language-independent, fast and requires only the candidate and reference translation. IBM BLEU is based on the n-gram precision by matching the machine translation output against one or more reference translations. It accounts for adequacy and fluency through word precision, respectively the n-gram precision, by calculating the geometric mean. Instead of recall, in IBM BLEU the brevity penalty (BP) was introduced.
Different from IBM BLEU, METEOR evaluates a candidate translation by calculating the precision and recall on unigram level and combining them in a parametrized harmonic mean. The result from the harmonic mean is than scaled by a fragmentation penalty which penalizes gaps and differences in word order.
For our investigation we applied METEOR on the human translated text. Our intention is to test whether we can reproduce the observations from the experiments: is the experiment setting 3 better than the setting of experiment 2? Therefore, METEOR is used here to investigate whether we can correlate it with our experiments and not to evaluate the produced translations.  In experiment number 3 we have previously observed that the translators' performance was significantly better and that translators could translate each segment on average 33.77% faster than experiment 2 and 52.82% faster than experiment 1. By applying METEOR scores we can also observe that experiment 3 achieved higher scores which seems to indicate more suitable translations than experiment number 2. Quality estimation is one of the aspects we would like to explore in future work.

Conclusion
This paper is a first step towards the comparison of different TMs produced with MT output and their direct impact in human translation. Our study shows a substantial improvement in performance with the use of translation memories containing MT output used trough commercial CAT software. To our knowledge this experiment setting was not tested in similar studies, which makes our paper a new contribution in the study of translators' performance. Although the performance gain seems intuitive, the quantification of these aspects within a controlled experiment was not substantially explored.
We opted for the use of a state-of-the-art commercial CAT tool as this is the real-world scenario that most translators face everyday. In comparison to translating without TM, translators were on average 28.87% faster using a modified TM and 52.82% using an unmodified one. Between the two TMs we observed that translators were on average 33.77% faster when using the unmodified TM. As previously mentioned, the order in which this tasks were carried out should be also taken into account. The performance boost of 33.77% when using a TM that is only 4,93% better is also an interesting outcome of our experiments that should be looked at in more detail.
Finally, in this paper we used METEOR scores to assess whether it is possible to correlate translations' speed, quality and TM coverage. The average score for experiment number 2 was 0.14 and for experiment number 3 was 0.41. Our initial analysis suggests that a relation between the two variables exists for our dataset. Whether this relation can be found in other scenarios is still an open question and we wish to investigate this variable more carefully in future work.

Future Work
We consider these experiments as a pilot study that was carried out to provide us a set of variables that we wish to investigate further. There are a number of aspects that we wish to look in more detail in future work.
Future experiments include the aforementioned quality estimation analysis by applying state-ofthe-art metrics used in machine translation. Using these metrics we would like to explore the extent to which it is possible to use automatic methods to study the interplay between quality and performance in computer assisted translation. Furthermore, we would like to perform a qualitative analysis of the produced translations using human annotators and inter annotator agreement (Carletta, 1996).
The performance boost observed between scenarios 2 and 3 should be looked in more detail in future experiments. We would like to replicate these experiments using other different TMs and explore this variable more carefully. Another aspect that we would like to explore in the future is the direct impact of the use of different CAT tools. Does the same TM combined with different CAT tools produce different results? When conducting these experiments, we observed that a simplified interface may speed up translators' work considerably.
Other directions that our work will take include controlling other variables not taken into account in this pilot study such as: the use of terminological databases, spelling correctors, etc. How and to which extent do they influence performance and quality? Finally, we would also like to use eye-tracking to analyse the focus of attention of translators as it was done in previous experiments (O'brien, 2006).