Joint EM+/CNGL Workshop (2010)
Proceedings of the Second Joint EM+/CNGL Workshop: Bringing MT to the User: Research on Integrating MT in the Translation Industry
For a long time, machine translation and professional translation vendors have had a contentious relation. However, new tools, computing platforms, and business models are changing the fundamentals of this relationship. I will review the main trends in the area while emphasizing both past causes of failure and main drivers of success.
The exploitation of large corpora to create and populate shared translation resources has been hampered in two areas: first, practical problems (“locked-in” data, ineffective exchange formats, client reservations); and second, ethical and legal problems. Recent developments, notably online collaborative translation environments (Desillets, 2007) and greater industry openness, might have been expected to highlight such issues. Yet the growing use of shared data is being addressed only gingerly. Good reasons lie behind the failure to broach the ethics of shared resources. The issues are challenging: confidentiality, ownership, copyright, authorial rights, attribution, the law, protectionism, costs, fairness, motivation, trust, quality, reliability. However, we argue that, though complex, these issues should not be swept under the carpet. The huge demand for translation cannot be met without intelligent sharing of resources (Kelly, 2009). Relevant ethical considerations have already been identified in translation and related domains, in such texts as Codes of Ethics, international conventions and declarations, and Codes of Professional Conduct; these can be useful here. We outline two case studies from current industry initiatives, highlighting their ethical implications. We identify questions which users and developers should be asking and relate these to existing debates and codes as a practical framework for their consideration.
The purpose of this work is to show how machine translation can be integrated into professional translation environments using two possible workflows. In the first workflow we demonstrate the real-time, sentence-by-sentence use of both rule-based and statistical machine translation systems with translation memory programs. In the second workflow we present a way of applying machine translation to full translation projects beforehand. We also compare and discuss the efficiency of statistical and rule-based machine translation systems, and propose some ideas about how these systems could be combined with translation memory technologies into a unified translation application.
We present two methods that merge ideas from statistical machine translation (SMT) and translation memories (TM). We use a TM to retrieve matches for source segments, and replace the mismatched parts with instructions to an SMT system to fill in the gap. We show that for fuzzy matches of over 70%, one method outperforms both SMT and TM baselines.
Although Machine Translation (MT) has been attracting more and more attention from the translation industry, the quality of current MT systems still requires humans to post-edit translations to ensure their quality. The time necessary to post-edit bad quality translations can be the same or even longer than that of translating without an MT system. It is well known, however, that the quality of an MT system is generally not homogeneous across all translated segments. In order to make MT more useful to the translation industry, it is therefore crucial to have a mechanism to judge MT quality at the segment level to prevent bad quality translations from being post-edited within the translation workflow. We describe an approach to estimate translation post-editing effort at sentence level in terms of Human-targeted Translation Edit Rate (HTER) based on a number of features reflecting the difficulty of translating the source sentence and discrepancies between the source and translation sentences. HTER is a simple metric and obtaining HTER annotated data can be made part of the translation workflow. We show that this approach is more reliable at filtering out bad translations than other simple criteria commonly used in the translation industry, such as sentence length.
This paper focuses on the relationship between source text characteristics (ambiguity, complexity and style compliance) and machine-translation post-editing effort (both temporal and technical). Post-editing data is collected in a traditional translation environment and subsequently plotted against textual scores produced by a range of systems. Our findings show some strong correlation between ambiguity and complexity scores and technical post-editing effort, as well as moderate correlation between one of the style guide compliance scores and temporal post-editing effort.
This paper describes our work on building and employing Statistical Machine Translation systems for TV subtitles in Scandinavia. We have built translation systems for Danish, English, Norwegian and Swedish. They are used in daily subtitle production and translate large volumes. As an example we report on our evaluation results for three TV genres. We discuss our lessons learned in the system development process which shed interesting light on the practical use of Machine Translation technology.