Ondřej Bojar - ACL Anthology

Ondřej Bojar

Also published as: Ondrej Bojar

2026

Thesis proposal: Are We Losing Textual Diversity to Natural Language Processing?
Josef Jon | Ondřej Bojar
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools into our daily lives.As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts, thereby contributing to a cycle of monotonous, repetitive language – whether generated by machines or humans.To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the generated language, compared to human translators.We further analyze potential origins of these limitations within existing training objectives and decoding strategies. Ultimately, our goal is to propose and validate alternative approaches (e.g., loss functions, decoding algorithms) that maintain the diversity and complexity of language and that allow for better global planning of the output generation, enabling the models to better reflect the ambiguities inherent in human communication.

Current state of LLMs for Arabic dialectal machine translation
Josef Jon | Rawan Bondok | Ondřej Bojar
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

This work presents an evaluation of large language models (LLMs) for English to dialectal Arabic machine translation on the MADAR dataset. We evaluate both translation directions (English to Arabic and vice-versa) on 16 Arabic dialects. Our experiments cover a diverse set of models, including specialized Arabic models (Jais, Nile), multilingual models (Gemma, Command-R, Mistral, Aya), and commercial APIs (GPT-4.1). We employ multiple evaluation metrics: BLEU, CHRF, COMET (both reference-based and reference-less variants) and GEMBA (LLM-as-a-judge), as well as a small-scale manual evaluation, to assess translation quality. We discuss the challenges of automatic MT evaluation, especially in the context of Arabic dialects. We also evaluate the ability of LLMs to classify the dialect used in a text. The study offers insights into the capabilities and limitations of current LLMs for dialectal Arabic machine translation, particularly highlighting the difficulty of handling dialectal diversity, although the results may be influenced by possible training data contamination, which is always a concern with LLMs.

Machine Translation for Low-Resource Languages through Monolingual Data and LLM: A Case Study of English-to-Basque
Nam Luu | Aitor Soroa | German Rigau | Ondřej Bojar
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Developing a machine translation (MT) system requires a considerable amount of high-quality parallel data, which is often limited for low-resource languages. This paper explores the use of synthetic data for training an LLM-based MT system in the English-to-Basque direction. Using Basque monolingual corpora as a starting point, we apply back-translation to generate parallel corpora, taking advantage of the fact that current LLMs do not translate well from English to Basque, but they yield an acceptable performance in the reverse direction. We conduct experiments in a multi-stage approach, from a simple Supervised Fine-tuning (SFT) step, to preference learning with the Direct Preference Optimization (DPO) technique. We then evaluate the approach with both automatic metrics and manual assessment. Experimental results suggest that for this task, SFT brings a clear improvement in translation quality, while DPO only yields marginal enhancement.

2025

Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn’t Help with MT Evaluation
Petra Barančíková | Ondřej Bojar
Proceedings of Machine Translation Summit XX: Volume 1

In this paper, we compare Czech-specific and multilingual sentence embedding models through intrinsic and extrinsic evaluation paradigms. For intrinsic evaluation, we employ Costra, a complex sentence transformation dataset, and several Semantic Textual Similarity (STS) benchmarks to assess the ability of the embeddings to capture linguistic phenomena such as semantic similarity, temporal aspects, and stylistic variations. In the extrinsic evaluation, we fine-tune each embedding model using COMET-based metrics for machine translation evaluation. Our experiments reveal an interesting disconnect: models that excel in intrinsic semantic similarity tests do not consistently yield superior performance on downstream translation evaluation tasks. Conversely, models with seemingly over-smoothed embedding spaces can, through fine-tuning, achieve excellent results. These findings highlight the complex relationship between semantic property probes and downstream task, emphasizing the need for more research into “operationalizable semantics” in sentence embeddings, or more in-depth downstream tasks datasets (here translation evaluation).

Finetuning LLMs for EvaCun 2025 token prediction shared task
Josef Jon | Ondřej Bojar
Proceedings of the Second Workshop on Ancient Language Processing

In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

How “Real” is Your Real-Time Simultaneous Speech-to-Text Translation System?
Sara Papi | Peter Polák | Dominik Macháček | Ondřej Bojar
Transactions of the Association for Computational Linguistics, Volume 13

Simultaneous speech-to-text translation (SimulST) translates source-language speech into target-language text concurrently with the speaker’s speech, ensuring low latency for better user comprehension. Despite its intended application to unbounded speech, most research has focused on human pre-segmented speech, simplifying the task and overlooking significant challenges. This narrow focus, coupled with widespread terminological inconsistencies, is limiting the applicability of research outcomes to real-world applications, ultimately hindering progress in the field. Our extensive literature review of 110 papers not only reveals these critical issues in current research but also serves as the foundation for our key contributions. We: 1) define the steps and core components of a SimulST system, proposing a standardized terminology and taxonomy; 2) conduct a thorough analysis of community trends; and 3) offer concrete recommendations and future directions to bridge the gaps in existing literature, from evaluation frameworks to system architectures, for advancing the field towards more realistic and effective SimulST solutions.

CUNI-NL@IWSLT 2025: End-to-end Offline Speech Translation and Instruction Following with LLMs
Nam Luu | Ondřej Bojar
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper describes the CUNI-NL team’s submission to the IWSLT 2025 Offline Speech Translation and Instruction Following tasks, focusing on transcribing the English audio, and translating the English audio to German text. Our systems follow the end-to-end approach, where each system consists of a pretrained, frozen speech encoder, along with a medium-sized large language model fine-tuned with LoRA on three tasks: 1) transcribing the English audio; 2) directly translating the English audio to German text; and 3) a combination of the above two tasks, i.e. simultaneously transcribing the English audio and translating the English audio to German text.

CUNI and Phrase at WMT25 MT Evaluation Task
Miroslav Hrabal | Ondrej Glembek | Aleš Tamchyna | Almut Silja Hildebrand | Alan Eckhard | Miroslav Štola | Sergio Penkale | Zuzana Šimečková | Ondřej Bojar | Alon Lavie | Craig Stewart
Proceedings of the Tenth Conference on Machine Translation

This paper describes the joint effort of Phrase a.s. and Charles University’sInstitute of Formal and Applied Linguistics (CUNI/UFAL) on the WMT25Automated Translation Quality Evaluation Systems Shared Task. Both teamsparticipated both in a collaborative and competitive manner, i.e. they eachsubmitted a system of their own as well as a contrastive joint system ensemble.In Task~1, we show that such an ensembling—if chosen in a clever way—canlead to a performance boost. We present the analysis of various kinds ofsystems comprising both “traditional” NN-based approach, as well as differentflavours of LLMs—off-the-shelf commercial models, their fine-tuned versions,but also in-house, custom-trained alternative models. In Tasks~2 and~3 we showPhrase’s approach to tackling the tasks via various GPT models: Error SpanAnnotation via the complete MQM solution using non-reasoning models (includingfine-tuned versions) in Task~2, and using reasoning models in Task~3.

MockConf: A Student Interpretation Dataset: Analysis, Word- and Span-level Alignment and Baselines
Dávid Javorský | Ondřej Bojar | François Yvon
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In simultaneous interpreting, an interpreter renders the speech into another language with a very short lag, much sooner than sentences are finished. In order to understand and later reproduce this dynamic and complex task automatically, we need specialized datasets and tools for analysis, monitoring, and evaluation, such as parallel speech corpora, and tools for their automatic annotation. Existing parallel corpora of translated texts and associated alignment algorithms hardly fill this gap, as they fail to model long-range interactions between speech segments or specific types of divergences (e.g. shortening, simplification, functional generalization) between the original and interpreted speeches. In this work, we develop and explore MockConf, a student interpretation dataset that was collected from Mock Conferences run as part of the students’ curriculum. This dataset contains 7 hours of recordings in 5 European languages, transcribed and aligned at the level of spans and words. We further implement and release InterAlign, a modern web-based annotation tool for parallel word and span annotations on long inputs, suitable for aligning simultaneous interpreting. We propose metrics for the evaluation and a baseline for automatic alignment. Dataset and tools will be released to the community.

This paper presents the results of the General Machine Translation Task organized as part of the 2025 Conference on Machine Translation (WMT). Participants were invited to build systems for any of 30 language pairs. For half of these pairs, we conducted a human evaluation on test sets spanning four to five different domains.We evaluated 60 systems in total: 36 submitted by participants and 24 for which we collected translations from large language models (LLMs) and popular online translation providers.This year, we focused on creating challenging test sets by developing a difficulty sampling technique and using more complex source data. We evaluated system outputs with professional annotators using the Error Span Annotation (ESA) protocol, except for two language pairs, for which we used Multidimensional Quality Metrics (MQM) instead.We continued the trend of increasingly moving towards document-level translation, providing the source texts as whole documents containing multiple paragraphs.

Findings of the IWSLT 2025 Evaluation Campaign
Idris Abdulmumin | Victor Agostinelli | Tanel Alumäe | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Fethi Bougares | Roldano Cattoni | Mauro Cettolo | Lizhong Chen | William Chen | Raj Dabre | Yannick Estève | Marcello Federico | Mark Fishel | Marco Gaido | Dávid Javorský | Marek Kasztelnik | Fortuné Kponou | Mateusz Krubiński | Tsz Kin Lam | Danni Liu | Evgeny Matusov | Chandresh Kumar Maurya | John P. McCrae | Salima Mdhaffar | Yasmin Moslem | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Atul Kr. Ojha | John E. Ortega | Sara Papi | Pavel Pecina | Peter Polák | Piotr Połeć | Ashwin Sankar | Beatrice Savoldi | Nivedita Sethiya | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Marco Turchi | Alex Waibel | Patrick Wilken | Rodolfo Zevallos | Vilém Zouhar | Maike Züfle
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

This paper presents the outcomes of the shared tasks conducted at the 22nd International Workshop on Spoken Language Translation (IWSLT). The workshop addressed seven critical challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, model compression, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks garnered significant participation, with 32 teams submitting their runs. The field’s growing importance is reflected in the increasing diversity of shared task organizers and contributors to this overview paper, representing a balanced mix of industrial and academic institutions. This broad participation demonstrates the rising prominence of spoken language translation in both research and practical applications.

OVQA: A Dataset for Visual Question Answering and Multimodal Research in Odia Language
Shantipriya Parida | Shashikanta Sahoo | Sambit Sekhar | Kalyanamalini Sahoo | Ketan Kotwal | Sonal Khosla | Satya Ranjan Dash | Aneesh Bose | Guneet Singh Kohli | Smruti Smita Lenka | Ondřej Bojar
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

This paper introduces OVQA, the first multimodal dataset designed for visual question-answering (VQA), visual question elicitation (VQE), and multimodal research for the low-resource Odia language. The dataset was created by manually translating 6,149 English question-answer pairs, each associated with 6,149 unique images from the Visual Genome dataset. This effort resulted in 27,809 English-Odia parallel sentences, ensuring a semantic match with the corresponding visual information. Several baseline experiments were conducted on the dataset, including visual question answering and visual question elicitation. The dataset is the first VQA dataset for the low-resource Odia language and will be released for multimodal research purposes and also help researchers extend for other low-resource languages.

LLM Compression: How Far Can We Go in Balancing Size and Performance?
Sahil Sk | Debashish Dhal | Sonal Khosla | Akash Dhaka | Shantipriya Parida | Sk Shahid | Sambit Shekhar | Dilip Prasad | Ondrej Bojar
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

Quantization is an essential and popular technique for improving the accessibility of large language models (LLMs) by reducing memory usage and computational costs while maintaining performance. In this study, we apply 4-bit Group Scaling Quantization (GSQ) and Generative Pretrained Transformer Quantization (GPTQ) to LLaMA 1B, Qwen 0.5B, and PHI 1.5B, evaluating their impact across multiple NLP tasks. We benchmark these models on MS MARCO (Information Retrieval), BoolQ (Boolean Question Answering), and GSM8K (Mathematical Reasoning) datasets, assessing both accuracy and efficiency accross various tasks. The study measures the trade-offs between model compression and task performance, analyzing key evaluation metrics namely: accuracy, inference latency, and throughput, providing insights into the suitability of low-bit quantization for real-world deployment and highlight the tradeoffs between memory, computing and latency in such settings, helping a user make suitable decisions

CUNI at WMT25 General Translation Task
Miroslav Hrabal | Josef Jon | Martin Popel | Ondřej Bojar
Proceedings of the Tenth Conference on Machine Translation

This paper describes the CUNI submissions to the WMT25 General Translation task, namely for the English to Czech, English to Serbian, Czech to German and Czech to Ukrainian language pairs. We worked in multiple teams, each with a different approach, spanning from traditional, smaller Transformer NMT models trained on both sentence and document level, to fine-tuning LLMs using LoRA and CPO. We show that these methods are effective in improving automatic MT evaluation scores compared to the base pretrained models.

Findings of WAT2025 English-to-Indic Multimodal Translation Task
Shantipriya Parida | Ondřej Bojar
Proceedings of the Twelfth Workshop on Asian Translation (WAT 2025)

This paper presents the findings of the English-to-Indic Multimodal Translation shared task from the Workshop on Asian Translation (WAT2025). The task featured three tracks: text-only translation, image captioning, and multimodal translation across four low-resource Indic languages: Hindi, Bengali, Malayalam, and Odia. Three teams participated, submitting systems that achieved competitive performance, with BLEU scores ranging from 40.1 to 64.3 across different language pairs and tracks.

Prompting LLMs: Length Control for Isometric Machine Translation
Dávid Javorský | Ondřej Bojar | François Yvon
Proceedings of the 22nd International Conference on Spoken Language Translation (IWSLT 2025)

In this study, we explore the effectiveness of isometric machine translation across multiple language pairs (EnoDe, EnoFr, and EnoEs) under the conditions of the IWSLT Isometric Shared Task 2022. Using eight open-source large language models (LLMs) of varying sizes, we investigate how different prompting strategies, varying numbers of few-shot examples, and demonstration selection influence translation quality and length control. We discover that the phrasing of instructions, when aligned with the properties of the provided demonstrations, plays a crucial role in controlling the output length. Our experiments show that LLMs tend to produce shorter translations only when presented with extreme examples, while isometric demonstrations often lead to the models disregarding length constraints. While few-shot prompting generally enhances translation quality, further improvements are marginal across 5, 10, and 20-shot settings. Finally, considering multiple outputs allows to notably improve overall tradeoff between the length and quality, yielding state-of-the-art performance for some language pairs.

2024

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN
Ibrahim Said Ahmad | Antonios Anastasopoulos | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | William Chen | Qianqian Dong | Marcello Federico | Barry Haddow | Dávid Javorský | Mateusz Krubiński | Tsz Kin Lam | Xutai Ma | Prashant Mathur | Evgeny Matusov | Chandresh Maurya | John P. McCrae | Kenton Murray | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | Atul Kr. Ojha | John Ortega | Sara Papi | Peter Polák | Adam Pospíšil | Pavel Pecina | Elizabeth Salesky | Nivedita Sethiya | Balaram Sarkar | Jiatong Shi | Claytone Sikasote | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Brian Thompson | Alex Waibel | Shinji Watanabe | Patrick Wilken | Petr Zemánek | Rodolfo Zevallos
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

This paper reports on the shared tasks organized by the 21st IWSLT Conference. The shared tasks address 7 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, dialect and low-resource speech translation, and Indic languages. The shared tasks attracted 17 teams whose submissions are documented in 27 system papers. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

Unveiling Semantic Information in Sentence Embeddings
Leixin Zhang | David Burian | Vojtěch John | Ondřej Bojar
Proceedings of the Fifth International Workshop on Designing Meaning Representations @ LREC-COLING 2024

This study evaluates the extent to which semantic information is preserved within sentence embeddings generated from state-of-art sentence embedding models: SBERT and LaBSE. Specifically, we analyzed 13 semantic attributes in sentence embeddings. Our findings indicate that some semantic features (such as tense-related classes) can be decoded from the representation of sentence embeddings. Additionally, we discover the limitation of the current sentence embedding models: inferring meaning beyond the lexical level has proven to be difficult.

An Analysis of Surprisal Uniformity in Machine and Human Translations
Josef Jon | Ondřej Bojar
Proceedings of the 1st Workshop on Creative-text Translation and Technology

This study examines neural machine translation (NMT) and its performance on texts that diverege from typical standards, focusing on how information is organized within sentences. We analyze surprisal distributions in source texts, human translations, and machine translations across several datasets to determine if NMT systems naturally promote a uniform density of surprisal in their translations, even when the original texts do not adhere to this principle.The findings reveal that NMT tends to align more closely with source texts in terms of surprisal uniformity compared to human translations.We analyzed absolute values of the surprisal uniformity measures as well, expecting that human translations will be less uniform. In contradiction to our initial hypothesis, we did not find comprehensive evidence for this claim, with some results suggesting this might be the case for very diverse texts, like poetry.

Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation
Matthias Sperber | Ondřej Bojar | Barry Haddow | Dávid Javorský | Xutai Ma | Matteo Negri | Jan Niehues | Peter Polák | Elizabeth Salesky | Katsuhito Sudoh | Marco Turchi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take the first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation.

Findings of WMT2024 English-to-Low Resource Multimodal Translation Task
Shantipriya Parida | Ondřej Bojar | Idris Abdulmumin | Shamsuddeen Hassan Muhammad | Ibrahim Said Ahmad
Proceedings of the Ninth Conference on Machine Translation

This paper presents the results of the English-to-Low Resource Multimodal Translation shared tasks from the Ninth Conference on Machine Translation (WMT2024). This year, 7 teams submitted their translation results for the automatic and human evaluation.

Quality and Quantity of Machine Translation References for Automatic Metrics
Vilém Zouhar | Ondřej Bojar
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024

Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

GAATME: A Genetic Algorithm for Adversarial Translation Metrics Evaluation
Josef Jon | Ondřej Bojar
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Building on a recent method for decoding translation candidates from a Machine Translation (MT) model via a genetic algorithm, we modify it to generate adversarial translations to test and challenge MT evaluation metrics. The produced translations score very well in an arbitrary MT evaluation metric selected beforehand, despite containing serious, deliberately introduced errors. The method can be used to create adversarial test sets to analyze the biases and shortcomings of the metrics. We publish various such test sets for the Czech to English language pair, as well as the code to convert any parallel data into a similar adversarial test set.

Human and Machine: Language Processing in Translation Tasks
Hening Wang | Leixin Zhang | Ondrej Bojar
Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024)

This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).

Khan Academy Corpus: A Multilingual Corpus of Khan Academy Lectures
Dominika Ďurišková | Daniela Jurášová | Matúš Žilinec | Eduard Šubert | Ondřej Bojar
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the Khan Academy Corpus totalling 10122 hours in 87394 recordings across 29 languages, where 43% of recordings (4252 hours) are equipped with human-written subtitles. The subtitle texts cover a total of 137 languages. The dataset was collected from open access Khan Academy lectures, benefiting from their manual transcripts and manual translations of the transcripts. The dataset can serve in creation or evaluation of multilingual speech recognition or translation systems, featuring a diverse set of subject domains.

CUNI at WMT24 General Translation Task: LLMs, (Q)LoRA, CPO and Model Merging
Miroslav Hrabal | Josef Jon | Martin Popel | Nam Luu | Danil Semin | Ondřej Bojar
Proceedings of the Ninth Conference on Machine Translation

This paper presents the contributions of Charles University teams to the WMT24 General Translation task (English to Czech, German and Russian, and Czech to Ukrainian), and the WMT24 Translation into Low-Resource Languages of Spain task.Our most elaborate submission, CUNI-MH for en2cs, is the result of fine-tuning Mistral 7B v0.1 for translation using a three-stage process: Supervised fine-tuning using QLoRA, Contrastive Preference Optimization, and merging of model checkpoints. We also describe the CUNI-GA, CUNI-Transformer and CUNI-DocTransformer submissions, which are based on our systems from the previous year.Our en2ru system CUNI-DS uses a similar first stage as CUNI-MH (QLoRA for en2cs) and follows with transferring to en2ru.For en2de (CUNI-NL), we experimented with a LLM-based speech translation system, to translate without the speech input.For the Translation into Low-Resource Languages of Spain task, we performed QLoRA fine-tuning of a large LLM on a small amount of synthetic (backtranslated) data.

2023

Assessing Word Importance Using Models Trained for Semantic Tasks
Dávid Javorský | Ondřej Bojar | François Yvon
Findings of the Association for Computational Linguistics: ACL 2023

Many NLP tasks require to automatically identify the most significant words in a text. In this work, we derive word significance from models trained to solve semantic task: Natural Language Inference and Paraphrase Identification. Using an attribution method aimed to explain the predictions of these models, we derive importance scores for each input token. We evaluate their relevance using a so-called cross-task evaluation: Analyzing the performance of one model on an input masked according to the other model’s weight, we show that our method is robust with respect to the choice of the initial task. Additionally, we investigate the scores from the syntax point of view and observe interesting patterns, e.g. words closer to the root of a syntactic tree receive higher importance scores. Altogether, these observations suggest that our method can be used to identify important words in sentences without any explicit word importance labeling in training.

Character-level NMT and language similarity
Josef Jon | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

We explore the effectiveness of character-level neural machine translation using Transformer architecture for various levels of language similarity and size of the training dataset. We evaluate the models using automatic MT metrics and show that translation between similar languages benefits from character-level input segmentation, while for less related languages, character-level vanilla Transformer-base often lags behind subword-level segmentation. We confirm previous findings that it is possible to close the gap by finetuning the already trained subword-level models to character-level.

Overview of the 10th Workshop on Asian Translation
Toshiaki Nakazawa | Kazutaka Kinugawa | Hideya Mino | Isao Goto | Raj Dabre | Shohei Higashiyama | Shantipriya Parida | Makoto Morishita | Ondřej Bojar | Akiko Eriguchi | Yusuke Oda | Chenhui Chu | Sadao Kurohashi
Proceedings of the 10th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 10th workshop on Asian translation (WAT2023). For the WAT2023, 2 teams submitted their translation results for the human evaluation. We also accepted 1 research paper. About 40 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation
Josef Jon | Ondřej Bojar
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a genetic algorithm (GA) based method for modifying n-best lists produced by a machine translation (MT) system. Our method offers an innovative approach to improving MT quality and identifying weaknesses in evaluation metrics. Using common GA operations (mutation and crossover) on a list of hypotheses in combination with a fitness function (an arbitrary MT metric), we obtain novel and diverse outputs with high metric scores. With a combination of multiple MT metrics as the fitness function, the proposed method leads to an increase in translation quality as measured by other held-out automatic metrics.With a single metric (including popular ones such as COMET) as the fitness function, we find blind spots and flaws in the metric. This allows for an automated search for adversarial examples in an arbitrary metric, without prior assumptions on the form of such example. As a demonstration of the method, we create datasets of adversarial examples and use them to show that reference-free COMET is substantially less robust than the reference-based version.

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
Milind Agarwal | Sweta Agrawal | Antonios Anastasopoulos | Luisa Bentivogli | Ondřej Bojar | Claudia Borg | Marine Carpuat | Roldano Cattoni | Mauro Cettolo | Mingda Chen | William Chen | Khalid Choukri | Alexandra Chronopoulou | Anna Currey | Thierry Declerck | Qianqian Dong | Kevin Duh | Yannick Estève | Marcello Federico | Souhir Gahbiche | Barry Haddow | Benjamin Hsu | Phu Mon Htut | Hirofumi Inaguma | Dávid Javorský | John Judge | Yasumasa Kano | Tom Ko | Rishu Kumar | Pengwei Li | Xutai Ma | Prashant Mathur | Evgeny Matusov | Paul McNamee | John P. McCrae | Kenton Murray | Maria Nadejde | Satoshi Nakamura | Matteo Negri | Ha Nguyen | Jan Niehues | Xing Niu | Atul Kr. Ojha | John E. Ortega | Proyag Pal | Juan Pino | Lonneke van der Plas | Peter Polák | Elijah Rippeth | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Yun Tang | Brian Thompson | Kevin Tran | Marco Turchi | Alex Waibel | Mingxuan Wang | Shinji Watanabe | Rodolfo Zevallos
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

This paper reports on the shared tasks organized by the 20th IWSLT Conference. The shared tasks address 9 scientific challenges in spoken language translation: simultaneous and offline translation, automatic subtitling and dubbing, speech-to-speech translation, multilingual, dialect and low-resource speech translation, and formality control. The shared tasks attracted a total of 38 submissions by 31 teams. The growing interest towards spoken language translation is also witnessed by the constantly increasing number of shared task organizers and contributors to the overview paper, almost evenly distributed across industry and academia.

CUNI at WMT23 General Translation Task: MT and a Genetic Algorithm
Josef Jon | Martin Popel | Ondřej Bojar
Proceedings of the Eighth Conference on Machine Translation

This paper presents the contributions of Charles University teams to the WMT23 General translation task (English to Czech and Czech to Ukrainian translation directions). Our main submission, CUNI-GA, is a result of applying a novel n-best list reranking and modification method on translation candidates produced by the two other submitted systems, CUNI-Transformer and CUNI-DocTransformer (document-level translation only used for the en → cs direction). Our method uses a genetic algorithm and MBR decoding to search for optimal translation under a given metric (in our case, a weighted combination of ChrF, BLEU, COMET22-DA, and COMET22-QE-DA). Our submissions are first in the constrained track and show competitive performance against top-tier unconstrained systems across various automatic metrics.

This paper presents the results of the General Machine Translation Task organised as part of the 2023 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 8 language pairs (corresponding to 14 translation directions), to be evaluated on test sets consisting of up to four different domains. We evaluate system outputs with professional human annotators using a combination of source-based Direct Assessment and scalar quality metric (DA+SQM).

Boosting Unsupervised Machine Translation with Pseudo-Parallel Data
Ivana Kvapilíková | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Even with the latest developments in deep learning and large-scale language modeling, the task of machine translation (MT) of low-resource languages remains a challenge. Neural MT systems can be trained in an unsupervised way without any translation resources but the quality lags behind, especially in truly low-resource conditions. We propose a training strategy that relies on pseudo-parallel sentence pairs mined from monolingual corpora in addition to synthetic sentence pairs back-translated from monolingual corpora. We experiment with different training schedules and reach an improvement of up to 14.5 BLEU points (English to Ukrainian) over a baseline trained on back-translated data only.

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks
Sunit Bhattacharya | Ondřej Bojar
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then combine the output from the ‘memories’ of the keys to generate predictions about the next token. This leads to an incremental process of prediction that gradually converges towards the final token choice near the output layers. This interesting perspective raises questions about how multilingual models might leverage this mechanism. Specifically, for autoregressive models trained on two or more languages, do all neurons (across layers) respond equally to all languages? No! Our hypothesis centers around the notion that during pre-training, certain model parameters learn strong language-specific features, while others learn more language-agnostic (shared across languages) features. To validate this, we conduct experiments utilizing parallel corpora of two languages that the model was initially pre-trained on. Our findings reveal that the layers closest to the network’s input or output tend to exhibit more language-specific behaviour compared to the layers in the middle.

Team Iterate @ AutoMin 2023 - Experiments with Iterative Minuting
František Kmječ | Ondřej Bojar
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

This report describes the development of our system for automatic minuting created for the AutoMin 2023 Task A. As a baseline, we utilize a system based on the BART encoder-decoder model paired with a preprocessing pipeline similar to the one introduced by the winning solutions at AutoMin 2021. We then further explore the possibilities for iterative summarization by constructing an iterative minuting dataset from the provided data, finetuning on it and feeding the model previously generated minutes. We also experiment with adding more context by utilizing the Longformer encoder-decoder model and finetuning it on the SAMSum dataset. Our submitted solution is of the baseline approach, since we were unable to match its performance with our iterative variants. With the baseline, we achieve a ROUGE-1 score of 0.368 on the ELITR minuting corpus development set. We finally explore the performance of Vicuna 13B quantized language model for summarization.

HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language
Shantipriya Parida | Idris Abdulmumin | Shamsuddeen Hassan Muhammad | Aneesh Bose | Guneet Singh Kohli | Ibrahim Said Ahmad | Ketan Kotwal | Sayan Deb Sarkar | Ondřej Bojar | Habeebah Kakudi
Findings of the Association for Computational Linguistics: ACL 2023

This paper presents “HaVQA”, the first multimodal dataset for visual question answering (VQA) tasks in the Hausa language. The dataset was created by manually translating 6,022 English question-answer pairs, which are associated with 1,555 unique images from the Visual Genome dataset. As a result, the dataset provides 12,044 gold standard English-Hausa parallel sentences that were translated in a fashion that guarantees their semantic match with the corresponding visual information. We conducted several baseline experiments on the dataset, including visual question answering, visual question elicitation, text-only and multimodal machine translation.

Proceedings of the 10th Workshop on Asian Translation
Toshiaki Nakazawa | Kazutaka Kinugawa | Hideya Mino | Isao Goto | Raj Dabre | Shohei Higashiyama | Shantipriya Parida | Makoto Morishita | Ondřej Bojar | Akiko Eriguchi | Yusuke Oda | Chenhui Chu | Sadao Kurohashi
Proceedings of the 10th Workshop on Asian Translation

Robustness of Multi-Source MT to Transcription Errors
Dominik Macháček | Peter Polák | Ondřej Bojar | Raj Dabre
Findings of the Association for Computational Linguistics: ACL 2023

Automatic speech translation is sensitive to speech recognition errors, but in a multilingual scenario, the same content may be available in various languages via simultaneous interpreting, dubbing or subtitling. In this paper, we hypothesize that leveraging multiple sources will improve translation quality if the sources complement one another in terms of correct information they contain. To this end, we first show that on a 10-hour ESIC corpus, the ASR errors in the original English speech and its simultaneous interpreting into German and Czech are mutually independent. We then use two sources, English and German, in a multi-source setting for translation into Czech to establish its robustness to ASR errors. Furthermore, we observe this robustness when translating both noisy sources together in a simultaneous translation setting. Our results show that multi-source neural machine translation has the potential to be useful in a real-time simultaneous translation setting, thereby motivating further investigation in this area.

Towards Efficient Simultaneous Speech Translation: CUNI-KIT System for Simultaneous Track at IWSLT 2023
Peter Polák | Danni Liu | Ngoc-Quan Pham | Jan Niehues | Alexander Waibel | Ondřej Bojar
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

In this paper, we describe our submission to the Simultaneous Track at IWSLT 2023. This year, we continue with the successful setup from the last year, however, we adopt the latest methods that further improve the translation quality. Additionally, we propose a novel online policy for attentional encoder-decoder models. The policy prevents the model to generate translation beyond the current speech input by using an auxiliary CTC output layer. We show that the proposed simultaneous policy can be applied to both streaming blockwise models and offline encoder-decoder models. We observe significant improvements in quality (up to 1.1 BLEU) and the computational footprint (up to 45% relative RTF).

Bad MT Systems are Good for Quality Estimation
Iryna Tryhubyshyn | Aleš Tamchyna | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Quality estimation (QE) is the task of predicting quality of outputs produced by machine translation (MT) systems. Currently, the highest-performing QE systems are supervised and require training on data with golden quality scores. In this paper, we investigate the impact of the quality of the underlying MT outputs on the performance of QE systems. We find that QE models trained on datasets with lower-quality translations often outperform those trained on higher-quality data. We also demonstrate that good performance can be achieved by using a mix of data from different MT systems.

Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023
Tirthankar Ghosal | Ondřej Bojar | Marie Hledíková | Tom Kocmi | Anna Nedoluzhko
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

In this article, we report the findings of the second shared task on Automatic Minuting (AutoMin) held as a Generation Challenge at the 16th International Natural Language Generation (INLG) Conference 2023. The second Automatic Minuting shared task is a successor to the first AutoMin which took place in 2021. The primary objective of the AutoMin shared task is to garner participation of the speech and natural language processing and generation community to create automatic methods for generating minutes from multi-party meetings. Five teams from diverse backgrounds participated in the shared task this year. A lot has changed in the Generative AI landscape since the last AutoMin especially with the emergence and wide adoption of Large Language Models (LLMs) to different downstream tasks. Most of the contributions are based on some form of an LLM and we are also adding current outputs of GPT4 as a benchmark. Furthermore, we examine the applicability of GPT-4 for automatic scoring of minutes. Compared to the previous instance of AutoMin, we also add another domain, the minutes for EU Parliament sessions, and we experiment with a more fine-grained manual evaluation. More details on the event can be found at https://ufal.github.io/automin-2023/.

MT Metrics Correlate with Human Ratings of Simultaneous Speech Translation
Dominik Macháček | Ondřej Bojar | Raj Dabre
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

There have been several meta-evaluation studies on the correlation between human ratings and offline machine translation (MT) evaluation metrics such as BLEU, chrF2, BertScore and COMET. These metrics have been used to evaluate simultaneous speech translation (SST) but their correlations with human ratings of SST, which has been recently collected as Continuous Ratings (CR), are unclear. In this paper, we leverage the evaluations of candidate systems submitted to the English-German SST task at IWSLT 2022 and conduct an extensive correlation analysis of CR and the aforementioned metrics. Our study reveals that the offline metrics are well correlated with CR and can be reliably used for evaluating machine translation in simultaneous mode, with some limitations on the test set size. We conclude that given the current quality levels of SST, these metrics can be used as proxies for CR, alleviating the need for large scale human evaluation. Additionally, we observe that correlations of the metrics with translation as a reference is significantly higher than with simultaneous interpreting, and thus we recommend the former for reliable evaluation.

The Role of Compounds in Human vs. Machine Translation Quality
Kristyna Neumannova | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

We focus on the production of German compounds in English-to-German manual and automatic translation. On the example of WMT21 news translation test set, we observe that even the best MT systems produce much fewer compounds compared to three independent manual translations. Despite this striking difference, we observe that this insufficiency is not apparent in manual evaluation methods that target the overall translation quality (DA and MQM). Simple automatic methods like BLEU somewhat surprisingly provide a better indication of this quality aspect. Our manual analysis of system outputs, including our freshly trained Transformer models, confirms that current deep neural systems operating at the level of subword units are capable of constructing novel words, including novel compounds. This effect however cannot be measured using static dictionaries of compounds such as GermaNet. German compounds thus pose an interesting challenge for future development of MT systems.

Low-Resource Machine Translation Systems for Indic Languages
Ivana Kvapilíková | Ondřej Bojar
Proceedings of the Eighth Conference on Machine Translation

We present our submission to the WMT23 shared task in translation between English and Assamese, Khasi, Mizo and Manipuri. All our systems were pretrained on the task of multilingual masked language modelling and denoising auto-encoding. Our primary systems for translation into English were further pretrained for multilingual MT in all four language directions and fine-tuned on the limited parallel data available for each language pair separately. We used online back-translation for data augmentation. The same systems were submitted as contrastive for translation out of English as the multilingual MT pretraining step seemed to harm the translation performance. Our primary systems for translation out of English were trained without the multilingual MT pretraining step. Other contrastive systems used additional pseudo-parallel data mined from monolingual corpora for pretraining.

Negative Lexical Constraints in Neural Machine Translation
Josef Jon | Dusan Varis | Michal Novák | João Paulo Aires | Ondřej Bojar
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

This paper explores negative lexical constraining in English to Czech neural machine translation. Negative lexical constraining is used to prohibit certain words or expressions in the translation produced by the NMT model. We compared various methods based on modifying either the decoding process or the training data. The comparison was performed on two tasks: paraphrasing and feedback-based translation refinement. We also studied how the methods “evade” the constraints, meaning that the disallowed expression is still present in the output, but in a changed form, most interestingly the case where a different surface form (for example different inflection) is produced. We propose a way to mitigate the issue through training with stemmed negative constraints, so that the ability of the model to induce different forms of a word might be used to prohibit the usage of all possible forms of the constraint. This helps to some extent, but the problem still persists in many cases.

Turning Whisper into Real-Time Transcription System
Dominik Macháček | Raj Dabre | Ondřej Bojar
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

2022

Automatic Minuting: A Pipeline Method for Generating Minutes from Multi-Party Meeting Proceedings
Kartik Shinde | Tirthankar Ghosal | Muskaan Singh | Ondrej Bojar
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

Multimodality for NLP-Centered Applications: Resources, Advances and Frontiers
Muskan Garg | Seema Wazarkar | Muskaan Singh | Ondřej Bojar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

With the development of multimodal systems and natural language generation techniques, the resurgence of multimodal datasets has attracted significant research interests, which aims to provide new information to enrich the representation of textual data. However, there remains a lack of a comprehensive survey for this task. To this end, we take the first step and present a thorough review of this research field. This paper provides an overview of a publicly available dataset with different modalities according to the applications. Furthermore, we discuss the new frontier and give our thoughts. We hope this survey of multimodal datasets can provide the community with quick access and a general picture of the multimodal dataset for specific Natural Language Processing (NLP) applications and motivates future researches. In this context, we release the collection of all multimodal datasets easily accessible here: https://github.com/drmuskangarg/Multimodal-datasets

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation
Peter Polák | Muskaan Singh | Anna Nedoluzhko | Ondřej Bojar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Summarization is a challenging problem, and even more challenging is to manually create, correct, and evaluate the summaries. The severity of the problem grows when the inputs are multi-party dialogues in a meeting setup. To facilitate the research in this area, we present ALIGNMEET, a comprehensive tool for meeting annotation, alignment, and evaluation. The tool aims to provide an efficient and clear interface for fast annotation while mitigating the risk of introducing errors. Moreover, we add an evaluation mode that enables a comprehensive quality evaluation of meeting minutes. To the best of our knowledge, there is no such tool available. We release the tool as open source. It is also directly installable from PyPI.

The Second Automatic Minuting (AutoMin) Challenge: Generating and Evaluating Minutes from Multi-Party Meetings
Tirthankar Ghosal | Marie Hledíková | Muskaan Singh | Anna Nedoluzhko | Ondřej Bojar
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

We would host the AutoMin generation chal- lenge at INLG 2023 as a follow-up of the first AutoMin shared task at Interspeech 2021. Our shared task primarily concerns the automated generation of meeting minutes from multi-party meeting transcripts. In our first venture, we ob- served the difficulty of the task and highlighted a number of open problems for the community to discuss, attempt, and solve. Hence, we invite the Natural Language Generation (NLG) com- munity to take part in the second iteration of AutoMin. Like the first, the second AutoMin will feature both English and Czech meetings and the core task of summarizing the manually- revised transcripts into bulleted minutes. A new challenge we are introducing this year is to devise efficient metrics for evaluating the quality of minutes. We will also host an optional track to generate minutes for European parliamentary sessions. We carefully curated the datasets for the above tasks. Our ELITR Minuting Corpus has been recently accepted to LREC 2022 and publicly released. We are already preparing a new test set for evaluating the new shared tasks. We hope to carry forward the learning from the first AutoMin and instigate more community attention and interest in this timely yet chal- lenging problem. INLG, the premier forum for the NLG community, would be an appropriate venue to discuss the challenges and future of Automatic Minuting. The main objective of the AutoMin GenChal at INLG 2023 would be to come up with efficient methods to auto- matically generate meeting minutes and design evaluation metrics to measure the quality of the minutes.

CUNI Submission to the BUCC 2022 Shared Task on Bilingual Term Alignment
Borek Požár | Klára Tauchmanová | Kristýna Neumannová | Ivana Kvapilíková | Ondřej Bojar
Proceedings of the BUCC Workshop within LREC 2022

We present our submission to the BUCC Shared Task on bilingual term alignment in comparable specialized corpora. We devised three approaches using static embeddings with post-hoc alignment, the Monoses pipeline for unsupervised phrase-based machine translation, and contextualized multilingual embeddings. We show that contextualized embeddings from pretrained multilingual models lead to similar results as static embeddings but further improvement can be achieved by task-specific fine-tuning. Retrieving term pairs from the running phrase tables of the Monoses systems can match this enhanced performance and leads to an average precision of 0.88 on the train set.

Sentence Ambiguity, Grammaticality and Complexity Probes
Sunit Bhattacharya | Vilém Zouhar | Ondrej Bojar
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

It is unclear whether, how and where large pre-trained language models capture subtle linguistic traits like ambiguity, grammaticality and sentence complexity. We present results of automatic classification of these traits and compare their viability and patterns across representation types. We demonstrate that template-based datasets with surface-level artifacts should not be used for probing, careful comparisons with baselines should be done and that t-SNE plots should not be used to determine the presence of a feature among dense vectors representations. We also show how features might be highly localized in the layers for these models and get lost in the upper layers.

CUNI-KIT System for Simultaneous Speech Translation Task at IWSLT 2022
Peter Polák | Ngoc-Quan Pham | Tuan Nam Nguyen | Danni Liu | Carlos Mullov | Jan Niehues | Ondřej Bojar | Alexander Waibel
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

In this paper, we describe our submission to the Simultaneous Speech Translation at IWSLT 2022. We explore strategies to utilize an offline model in a simultaneous setting without the need to modify the original model. In our experiments, we show that our onlinization algorithm is almost on par with the offline setting while being 3x faster than offline in terms of latency on the test set. We also show that the onlinized offline model outperforms the best IWSLT2021 simultaneous system in medium and high latency regimes and is almost on par in the low latency regime. We make our system publicly available.

CUNI Submission to MT4All Shared Task
Ivana Kvapilíková | Ondrej Bojar
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

This paper describes our submission to the MT4All Shared Task in unsupervised machine translation from English to Ukrainian, Kazakh and Georgian in the legal domain. In addition to the standard pipeline for unsupervised training (pretraining followed by denoising and back-translation), we used supervised training on a pseudo-parallel corpus retrieved from the provided mono-lingual corpora. Our system scored significantly higher than the baseline hybrid unsupervised MT system.

Findings of the IWSLT 2022 Evaluation Campaign
Antonios Anastasopoulos | Loïc Barrault | Luisa Bentivogli | Marcely Zanon Boito | Ondřej Bojar | Roldano Cattoni | Anna Currey | Georgiana Dinu | Kevin Duh | Maha Elbayad | Clara Emmanuel | Yannick Estève | Marcello Federico | Christian Federmann | Souhir Gahbiche | Hongyu Gong | Roman Grundkiewicz | Barry Haddow | Benjamin Hsu | Dávid Javorský | Vĕra Kloudová | Surafel Lakew | Xutai Ma | Prashant Mathur | Paul McNamee | Kenton Murray | Maria Nǎdejde | Satoshi Nakamura | Matteo Negri | Jan Niehues | Xing Niu | John Ortega | Juan Pino | Elizabeth Salesky | Jiatong Shi | Matthias Sperber | Sebastian Stüker | Katsuhito Sudoh | Marco Turchi | Yogesh Virkar | Alexander Waibel | Changhan Wang | Shinji Watanabe
Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)

The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved.

This paper presents the results of the General Machine Translation Task organised as part of the Conference on Machine Translation (WMT) 2022. In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of four different domains. We evaluate system outputs with human annotators using two different techniques: reference-based direct assessment and (DA) and a combination of DA and scalar quality metric (DA+SQM).

CUNI-Bergamot Submission at WMT22 General Translation Task
Josef Jon | Martin Popel | Ondřej Bojar
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the CUNI-Bergamot submission for the WMT22 General translation task. We compete in English-Czech direction. Our submission further explores block backtranslation techniques. Compared to the previous work, we measure performance in terms of COMET score and named entities translation accuracy. We evaluate performance of MBR decoding compared to traditional mixed backtranslation training and we show a possible synergy when using both of the techniques simultaneously. The results show that both approaches are effective means of improving translation quality and they yield even better results when combined.

Continuous Rating as Reliable Human Evaluation of Simultaneous Speech Translation
Dávid Javorský | Dominik Macháček | Ondřej Bojar
Proceedings of the Seventh Conference on Machine Translation (WMT)

Simultaneous speech translation (SST) can be evaluated on simulated online events where human evaluators watch subtitled videos and continuously express their satisfaction by pressing buttons (so called Continuous Rating). Continuous Rating is easy to collect, but little is known about its reliability, or relation to comprehension of foreign language document by SST users. In this paper, we contrast Continuous Rating with factual questionnaires on judges with different levels of source language knowledge. Our results show that Continuous Rating is easy and reliable SST quality assessment if the judges have at least limited knowledge of the source language. Our study indicates users’ preferences on subtitle layout and presentation style and, most importantly, provides a significant evidence that users with advanced source language knowledge prefer low latency over fewer re-translations.

ELITR Minuting Corpus: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech
Anna Nedoluzhko | Muskaan Singh | Marie Hledíková | Tirthankar Ghosal | Ondřej Bojar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Taking minutes is an essential component of every meeting, although the goals, style, and procedure of this activity (“minuting” for short) can vary. Minuting is a rather unstructured writing activity and is affected by who is taking the minutes and for whom the intended minutes are. With the rise of online meetings, automatic minuting would be an important benefit for the meeting participants as well as for those who might have missed the meeting. However, automatically generating meeting minutes is a challenging problem due to a variety of factors including the quality of automatic speech recorders (ASRs), availability of public meeting data, subjective knowledge of the minuter, etc. In this work, we present the first of its kind dataset on Automatic Minuting. We develop a dataset of English and Czech technical project meetings which consists of transcripts generated from ASRs, manually corrected, and minuted by several annotators. Our dataset, AutoMin, consists of 113 (English) and 53 (Czech) meetings, covering more than 160 hours of meeting content. Upon acceptance, we will publicly release (aaa.bbb.ccc) the dataset as a set of meeting transcripts and minutes, excluding the recordings for privacy reasons. A unique feature of our dataset is that most meetings are equipped with more than one minute, each created independently. Our corpus thus allows studying differences in what people find important while taking the minutes. We also provide baseline experiments for the community to explore this novel problem further. To the best of our knowledge AutoMin is probably the first resource on minuting in English and also in a language other than English (Czech).

Automated Evaluation Metric for Terminology Consistency in MT
Kirill Semenov | Ondřej Bojar
Proceedings of the Seventh Conference on Machine Translation (WMT)

The most widely used metrics for machine translation tackle sentence-level evaluation. However, at least for professional domains such as legal texts, it is crucial to measure the consistency of the translation of the terms throughout the whole text. This paper introduces an automated metric for the term consistency evaluation in machine translation (MT). To demonstrate the metric’s performance, we used the Czech-to-English translated texts from the ELITR 2021 agreement corpus and the outputs of the MT systems that took part in WMT21 News Task. We show different modes of our evaluation algorithm and try to interpret the differences in the ranking of the translation systems based on sentence-level metrics and our approach. We also demonstrate that the proposed metric scores significantly differ from the widespread automated metric scores, and correlate with the human assessment.

Genre Transfer in NMT:Creating Synthetic Spoken Parallel Sentences using Written Parallel Data
Nalin Kumar | Ondrej Bojar
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Text style transfer (TST) aims to control attributes in a given text without changing the content. The matter gets complicated when the boundary separating two styles gets blurred. We can notice similar difficulties in the case of parallel datasets in spoken and written genres. Genuine spoken features like filler words and repetitions in the existing spoken genre parallel datasets are often cleaned during transcription and translation, making the texts closer to written datasets. This poses several problems for spoken genre-specific tasks like simultaneous speech translation. This paper seeks to address the challenge of improving spoken language translations. We start by creating a genre classifier for individual sentences and then try two approaches for data augmentation using written examples:(1) a novel method that involves assembling and disassembling spoken and written neural machine translation (NMT) models, and (2) a rule-based method to inject spoken features. Though the observed results for (1) are not promising, we get some interesting insights into the solution. The model proposed in (1) fine-tuned on the synthesized data from (2) produces naturally looking spoken translations for written-to-spoken genre transfer in En-Hi translation systems. We use this system to produce a second-stage En-Hi synthetic corpus, which however lacks appropriate alignments of explicit spoken features across the languages. For the final evaluation, we fine-tune Hi-En spoken translation systems on the synthesized parallel corpora. We observe that the parallel corpus synthesized using our rule-based method produces the best results.

Overview of the 9th Workshop on Asian Translation
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Raj Dabre | Shohei Higashiyama | Shantipriya Parida | Anoop Kunchukuttan | Makoto Morishita | Ondřej Bojar | Chenhui Chu | Akiko Eriguchi | Kaori Abe | Yusuke Oda | Sadao Kurohashi
Proceedings of the 9th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022). For the WAT2022, 8 teams submitted their translation results for the human evaluation. We also accepted 4 research papers. About 300 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

Team ÚFAL at CMCL 2022 Shared Task: Figuring out the correct recipe for predicting Eye-Tracking features using Pretrained Language Models
Sunit Bhattacharya | Rishu Kumar | Ondrej Bojar
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Eye-Tracking data is a very useful source of information to study cognition and especially language comprehension in humans. In this paper, we describe our systems for the CMCL 2022 shared task on predicting eye-tracking information. We describe our experiments withpretrained models like BERT and XLM and the different ways in which we used those representations to predict four eye-tracking features. Along with analysing the effect of using two different kinds of pretrained multilingual language models and different ways of pooling the token-level representations, we also explore how contextual information affects the performance of the systems. Finally, we also explore if factors like augmenting linguistic information affect the predictions. Our submissions achieved an average MAE of 5.72 and ranked 5th in the shared task. The average MAE showed further reduction to 5.25 in post task evaluation.

Hausa Visual Genome: A Dataset for Multi-Modal English to Hausa Machine Translation
Idris Abdulmumin | Satya Ranjan Dash | Musa Abdullahi Dawud | Shantipriya Parida | Shamsuddeen Muhammad | Ibrahim Sa’id Ahmad | Subhadarshi Panda | Ondřej Bojar | Bashir Shehu Galadanci | Bello Shehu Bello
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Multi-modal Machine Translation (MMT) enables the use of visual information to enhance the quality of translations, especially where the full context is not available to enable the unambiguous translation in standard machine translation. Despite the increasing popularity of such technique, it lacks sufficient and qualitative datasets to maximize the full extent of its potential. Hausa, a Chadic language, is a member of the Afro-Asiatic language family. It is estimated that about 100 to 150 million people speak the language, with more than 80 million indigenous speakers. This is more than any of the other Chadic languages. Despite the large number of speakers, the Hausa language is considered as a low resource language in natural language processing (NLP). This is due to the absence of enough resources to implement most of the tasks in NLP. While some datasets exist, they are either scarce, machine-generated or in the religious domain. Therefore, there is the need to create training and evaluation data for implementing machine learning tasks and bridging the research gap in the language. This work presents the Hausa Visual Genome (HaVG), a dataset that contains the description of an image or a section within the image in Hausa and its equivalent in English. The dataset was prepared by automatically translating the English description of the images in the Hindi Visual Genome (HVG). The synthetic Hausa data was then carefully postedited, taking into cognizance the respective images. The data is made of 32,923 images and their descriptions that are divided into training, development, test, and challenge test set. The Hausa Visual Genome is the first dataset of its kind and can be used for Hausa-English machine translation, multi-modal research, image description, among various other natural language processing and generation tasks.

2021

Sequence Length is a Domain: Length-based Overfitting in Transformer Models
Dusan Varis | Ondřej Bojar
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing tasks and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.

Operating a Complex SLT System with Speakers and Human Interpreters
Ondřej Bojar | Vojtěch Srdečný | Rishu Kumar | Otakar Smrž | Felix Schneider | Barry Haddow | Phil Williams | Chiara Canton
Proceedings of the 1st Workshop on Automatic Spoken Language Translation in Real-World Settings (ASLTRW)

We describe our experience with providing automatic simultaneous spoken language translation for an event with human interpreters. We provide a detailed overview of the systems we use, focusing on their interconnection and the issues it brings. We present our tools to monitor the pipeline and a web application to present the results of our SLT pipeline to the end users. Finally, we discuss various challenges we encountered, their possible solutions and we suggest improvements for future deployments.

Constrained Decoding for Technical Term Retention in English-Hindi MT
Niyati Bafna | Martin Vastl | Ondřej Bojar
Proceedings of the 18th International Conference on Natural Language Processing (ICON)

Technical terms may require special handling when the target audience is bilingual, depending on the cultural and educational norms of the society in question. In particular, certain translation scenarios may require “term retention” i.e. preserving of the source language technical terms in the target language output to produce a fluent and comprehensible code-switched sentence. We show that a standard transformer-based machine translation model can be adapted easily to perform this task with little or no damage to the general quality of its output. We present an English-to-Hindi model that is trained to obey a “retain” signal, i.e. it can perform the required code-mixing on a list of terms, possibly unseen, provided at runtime. We perform automatic evaluation using BLEU as well as F1 metrics on the list of retained terms; we also collect manual judgments on the quality of the output sentences.

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2021) featured this year four shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Multilingual speech translation, (iv) Low-resource speech translation. A total of 22 teams participated in at least one of the tasks. This paper describes each shared task, data and evaluation metrics, and reports results of the received submissions.

Detecting Post-Edited References and Their Effect on Human Evaluation
Věra Kloudová | Ondřej Bojar | Martin Popel
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

This paper provides a quick overview of possible methods how to detect that reference translations were actually created by post-editing an MT system. Two methods based on automatic metrics are presented: BLEU difference between the suspected MT and some other good MT and BLEU difference using additional references. These two methods revealed a suspicion that the WMT 2020 Czech reference is based on MT. The suspicion was confirmed in a manual analysis by finding concrete proofs of the post-editing procedure in particular sentences. Finally, a typology of post-editing changes is presented where typical errors or changes made by the post-editor or errors adopted from the MT are classified.

CUNI Systems for WMT21: Terminology Translation Shared Task
Josef Jon | Michal Novák | João Paulo Aires | Dusan Varis | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

This paper describes Charles University sub-mission for Terminology translation Shared Task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high overall translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms. We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words, when they differ from the forms provided in the terminology database. Our submission ranked second in Exact Match metric which evaluates the ability of the model to produce desired terms in the translation.

An Empirical Performance Analysis of State-of-the-Art Summarization Models for Automatic Minuting
Muskaan Singh | Tirthankar Ghosal | Ondrej Bojar
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages
Josef Jon | João Paulo Aires | Dusan Varis | Ondřej Bojar
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases. Although current approaches can enforce terms to appear in the translation, they often struggle to make the constraint word form agree with the rest of the generated output. Our manual analysis shows that 46% of the errors in the output of a baseline constrained model for English to Czech translation are related to agreement. We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints. In particular, we focus on methods based on training the model with constraints provided as part of the input sequence. Our experiments on English-Czech language pair show that this approach improves translation of constrained terms in both automatic and manual evaluation by reducing errors in agreement. Our approach thus eliminates inflection errors, without introducing new errors or decreasing overall quality of the translation.

Explainable Quality Estimation: CUNI Eval4NLP Submission
Peter Polák | Muskaan Singh | Ondřej Bojar
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

This paper describes our participating system in the shared task Explainable quality estimation of 2nd Workshop on Evaluation & Comparison of NLP Systems. The task of quality estimation (QE, a.k.a. reference-free evaluation) is to predict the quality of MT output at inference time without access to reference translations. In this proposed work, we first build a word-level quality estimation model, then we finetune this model for sentence-level QE. Our proposed models achieve near state-of-the-art results. In the word-level QE, we place 2nd and 3rd on the supervised Ro-En and Et-En test sets. In the sentence-level QE, we achieve a relative improvement of 8.86% (Ro-En) and 10.6% (Et-En) in terms of the Pearson correlation coefficient over the baseline model.

This paper presents an automatic speech translation system aimed at live subtitling of conference presentations. We describe the overall architecture and key processing components. More importantly, we explain our strategy for building a complex system for end-users from numerous individual components, each of which has been tested only in laboratory conditions. The system is a working prototype that is routinely tested in recognizing English, Czech, and German speech and presenting it translated simultaneously into 42 target languages.

Backtranslation Feedback Improves User Confidence in MT, Not Quality
Vilém Zouhar | Michal Novák | Matúš Žilinec | Ondřej Bojar | Mateo Obregón | Robin L. Hill | Frédéric Blain | Marina Fomicheva | Lucia Specia | Lisa Yankovskaya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected: backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process: it increases user confidence in the produced translation, but not the objective quality.

SLTEV: Comprehensive Evaluation of Spoken Language Translation
Ebrahim Ansari | Ondřej Bojar | Barry Haddow | Mohammad Mahmoudi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

Automatic evaluation of Machine Translation (MT) quality has been investigated over several decades. Spoken Language Translation (SLT), esp. when simultaneous, needs to consider additional criteria and does not have a standard evaluation procedure and a widely used toolkit. To fill the gap, we develop SLTev, an open-source tool for assessing SLT in a comprehensive way. SLTev reports the quality, latency, and stability of an SLT candidate output based on the time-stamped transcript and reference translation into a target language. For quality, we rely on sacreBLEU which provides MT evaluation measures such as chrF or BLEU. For latency, we propose two new scoring techniques. For stability, we extend the previously defined measures with a normalized Flicker in our work. We also propose a new averaging of older measures. A preliminary version of SLTev was used in the IWSLT 2020 shared task. Moreover, a growing collection of test datasets directly accessible by SLTev are provided for system evaluation comparable across papers.

Neural Machine Translation Quality and Post-Editing Performance
Vilém Zouhar | Martin Popel | Ondřej Bojar | Aleš Tamchyna
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We test the natural expectation that using MT in professional translation saves human processing time. The last such study was carried out by Sanchez-Torron and Koehn (2016) with phrase-based MT, artificially reducing the translation quality. In contrast, we focus on neural MT (NMT) of high quality, which has become the state-of-the-art approach since then and also got adopted by most translation companies. Through an experimental study involving over 30 professional translators for English -> Czech translation, we examine the relationship between NMT performance and post-editing time and quality. Across all models, we found that better MT systems indeed lead to fewer changes in the sentences in this industry setting. The relation between system quality and post-editing time is however not straightforward and, contrary to the results on phrase-based MT, BLEU is definitely not a stable predictor of the time or final output quality.

A Fine-Grained Analysis of BERTScore
Michael Hanna | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

BERTScore, a recently proposed automatic metric for machine translation quality, uses BERT, a large pre-trained language model to evaluate candidate translations with respect to a gold translation. Taking advantage of BERT’s semantic and syntactic abilities, BERTScore seeks to avoid the flaws of earlier approaches like BLEU, instead scoring candidate translations based on their semantic similarity to the gold sentence. However, BERT is not infallible; while its performance on NLP tasks set a new state of the art in general, studies of specific syntactic and semantic phenomena have shown where BERT’s performance deviates from that of humans more generally. This naturally raises the questions we address in this paper: what are the strengths and weaknesses of BERTScore? Do they relate to known weaknesses on the part of BERT? We find that while BERTScore can detect when a candidate differs from a reference in important content words, it is less sensitive to smaller errors, especially if the candidate is lexically or stylistically similar to the reference.

This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 teams submitted their translation results for the human evaluation. We also accepted 5 research papers. About 2,100 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain
Markus Freitag | Ricardo Rei | Nitika Mathur | Chi-kiu Lo | Craig Stewart | George Foster | Alon Lavie | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

This paper presents the results of the WMT21 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT21 News Translation Task with automatic metrics on two different domains: news and TED talks. All metrics were evaluated on how well they correlate at the system- and segment-level with human ratings. Contrary to previous years’ editions, this year we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages: (i) expert-based evaluation has been shown to be more reliable, (ii) we were able to evaluate all metrics on two different domains using translations of the same MT systems, (iii) we added 5 additional translations coming from the same system during system development. In addition, we designed three challenge sets that evaluate the robustness of all automatic metrics. We present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. We further show the impact of different reference translations on reference-based metrics and compare our expert-based MQM annotation with the DA scores acquired by WMT.

This paper presents the results of the newstranslation task, the multilingual low-resourcetranslation for Indo-European languages, thetriangular translation task, and the automaticpost-editing task organised as part of the Con-ference on Machine Translation (WMT) 2021.In the news task, participants were asked tobuild machine translation systems for any of10 language pairs, to be evaluated on test setsconsisting mainly of news stories. The taskwas also opened up to additional test suites toprobe specific aspects of translation.

NLPHut’s Participation at WAT2021
Shantipriya Parida | Subhadarshi Panda | Ketan Kotwal | Amulya Ratna Dash | Satya Ranjan Dash | Yashvardhan Sharma | Petr Motlicek | Ondřej Bojar
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper provides the description of shared tasks to the WAT 2021 by our team “NLPHut”. We have participated in the English→Hindi Multimodal translation task, English→Malayalam Multimodal translation task, and Indic Multi-lingual translation task. We have used the state-of-the-art Transformer model with language tags in different settings for the translation task and proposed a novel “region-specific” caption generation approach using a combination of image CNN and LSTM for the Hindi and Malayalam image captioning. Our submission tops in English→Malayalam Multimodal translation task (text-only translation, and Malayalam caption), and ranks second-best in English→Hindi Multimodal translation task (text-only translation, and Hindi caption). Our submissions have also performed well in the Indic Multilingual translation tasks.

CUNI systems for WMT21: Multilingual Low-Resource Translation for Indo-European Languages Shared Task
Josef Jon | Michal Novák | João Paulo Aires | Dusan Varis | Ondřej Bojar
Proceedings of the Sixth Conference on Machine Translation

This paper describes Charles University sub-mission for Terminology translation shared task at WMT21. The objective of this task is to design a system which translates certain terms based on a provided terminology database, while preserving high overall translation quality. We competed in English-French language pair. Our approach is based on providing the desired translations alongside the input sentence and training the model to use these provided terms. We lemmatize the terms both during the training and inference, to allow the model to learn how to produce correct surface forms of the words, when they differ from the forms provided in the terminology database.

CUNI Systems in WMT21: Revisiting Backtranslation Techniques for English-Czech NMT
Petr Gebauer | Ondřej Bojar | Vojtěch Švandelík | Martin Popel
Proceedings of the Sixth Conference on Machine Translation

We describe our two NMT systems submitted to the WMT2021 shared task in English-Czech news translation: CUNI-DocTransformer (document-level CUBBITT) and CUNI-Marian-Baselines. We improve the former with a better sentence-segmentation pre-processing and a post-processing for fixing errors in numbers and units. We use the latter for experiments with various backtranslation techniques.

2020

ELITR (European Live Translator) project aims to create a speech translation system for simultaneous subtitling of conferences and online meetings targetting up to 43 languages. The technology is tested by the Supreme Audit Office of the Czech Republic and by alfaview®, a German online conferencing system. Other project goals are to advance document-level and multilingual machine translation, automatic speech recognition, and automatic minuting.

Results of the WMT20 Metrics Shared Task
Nitika Mathur | Johnny Wei | Markus Freitag | Qingsong Ma | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics. Ten research groups submitted 27 metrics, four of which are reference-less “metrics”. In addition, we computed five baseline metrics, including sentBLEU, BLEU, TER and using the SacreBLEU scorer. All metrics were evaluated on how well they correlate at the system-, document- and segment-level with the WMT20 official human scores. We present an extensive analysis on influence of different reference translations on metric reliability, how well automatic metrics score human translations, and we also flag major discrepancies between metric and human scores when evaluating MT systems. Finally, we investigate whether we can use automatic metrics to flag incorrect human ratings.

CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20
Ivana Kvapilíková | Tom Kocmi | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian. We experimented with training on synthetic data and pre-training on a related language pair. In the fully unsupervised scenario, we achieved 25.5 and 23.7 BLEU translating from and into Upper Sorbian, respectively. Our low-resource systems relied on transfer learning from German-Czech parallel data and achieved 57.4 BLEU and 56.1 BLEU, which is an improvement of 10 BLEU points over the baseline trained only on the available small German-Upper Sorbian parallel corpus.

Overview of the 7th Workshop on Asian Translation
Toshiaki Nakazawa | Hideki Nakayama | Chenchen Ding | Raj Dabre | Shohei Higashiyama | Hideya Mino | Isao Goto | Win Pa Pa | Anoop Kunchukuttan | Shantipriya Parida | Ondřej Bojar | Sadao Kurohashi
Proceedings of the 7th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020). For the WAT2020, 20 teams participated in the shared tasks and 14 teams submitted their translation results for the human evaluation. We also received 12 research paper submissions out of which 7 were accepted. About 500 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

ODIANLP’s Participation in WAT2020
Shantipriya Parida | Petr Motlicek | Amulya Ratna Dash | Satya Ranjan Dash | Debasish Kumar Mallick | Satya Prakash Biswal | Priyanka Pattnaik | Biranchi Narayan Nayak | Ondřej Bojar
Proceedings of the 7th Workshop on Asian Translation

This paper describes the ODIANLP submission to WAT 2020. We have participated in the English-Hindi Multimodal task and Indic task. We have used the state-of-the-art Transformer model for the translation task and InceptionResNetV2 for the Hindi Image Captioning task. Our submission tops in English->Hindi Multimodal task in its track and Odia<->English translation tasks. Also, our submissions performed well in the Indic Multilingual tasks.

Large Corpus of Czech Parliament Plenary Hearings
Jonáš Kratochvil | Peter Polák | Ondřej Bojar
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a large corpus of Czech parliament plenary sessions. The corpus consists of approximately 1200 hours of speech data and corresponding text transcriptions. The whole corpus has been segmented to short audio segments making it suitable for both training and evaluation of automatic speech recognition (ASR) systems. The source language of the corpus is Czech, which makes it a valuable resource for future research as only a few public datasets are available in the Czech language. We complement the data release with experiments of two baseline ASR systems trained on the presented data: the more traditional approach implemented in the Kaldi ASRtoolkit which combines hidden Markov models and deep neural networks (NN) and a modern ASR architecture implemented in Jaspertoolkit which uses deep NNs in an end-to-end fashion.

CUNI Neural ASR with Phoneme-Level Intermediate Step for~Non-Native~SLT at IWSLT 2020
Peter Polák | Sangeet Sagar | Dominik Macháček | Ondřej Bojar
Proceedings of the 17th International Conference on Spoken Language Translation

In this paper, we present our submission to the Non-Native Speech Translation Task for IWSLT 2020. Our main contribution is a proposed speech recognition pipeline that consists of an acoustic model and a phoneme-to-grapheme model. As an intermediate representation, we utilize phonemes. We demonstrate that the proposed pipeline surpasses commercially used automatic speech recognition (ASR) and submit it into the ASR track. We complement this ASR with off-the-shelf MT systems to take part also in the speech translation track.

ELITR Non-Native Speech Translation at IWSLT 2020
Dominik Macháček | Jonáš Kratochvíl | Sangeet Sagar | Matúš Žilinec | Ondřej Bojar | Thai-Son Nguyen | Felix Schneider | Philip Williams | Yuekun Yao
Proceedings of the 17th International Conference on Spoken Language Translation

This paper is an ELITR system submission for the non-native speech translation task at IWSLT 2020. We describe systems for offline ASR, real-time ASR, and our cascaded approach to offline SLT and real-time SLT. We select our primary candidates from a pool of pre-existing systems, develop a new end-to-end general ASR system, and a hybrid ASR trained on non-native speech. The provided small validation set prevents us from carrying out a complex validation, but we submit all the unselected candidates for contrastive evaluation on the test set.

The evaluation campaign of the International Conference on Spoken Language Translation (IWSLT 2020) featured this year six challenge tracks: (i) Simultaneous speech translation, (ii) Video speech translation, (iii) Offline speech translation, (iv) Conversational speech translation, (v) Open domain translation, and (vi) Non-native speech translation. A total of teams participated in at least one of the tracks. This paper introduces each track’s goal, data and evaluation metrics, and reports the results of the received submissions.

This paper presents our progress towards deploying a versatile communication platform in the task of highly multilingual live speech translation for conferences and remote meetings live subtitling. The platform has been designed with a focus on very low latency and high flexibility while allowing research prototypes of speech and text processing tools to be easily connected, regardless of where they physically run. We outline our architecture solution and also briefly compare it with the ELG platform. Technical details are provided on the most important components and we summarize the test deployment events we ran so far.

This paper presents the results of the news translation task and the similar language translation task, both organised alongside the Conference on Machine Translation (WMT) 2020. In the news task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting mainly of news stories. The task was also opened up to additional test suites to probe specific aspects of translation. In the similar language translation task, participants built machine translation systems for translating between closely related pairs of languages.

OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation
Shantipriya Parida | Satya Ranjan Dash | Ondřej Bojar | Petr Motlicek | Priyanka Pattnaik | Debasish Kumar Mallick
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

The preparation of parallel corpora is a challenging task, particularly for languages that suffer from under-representation in the digital world. In a multi-lingual country like India, the need for such parallel corpora is stringent for several low-resource languages. In this work, we provide an extended English-Odia parallel corpus, OdiEnCorp 2.0, aiming particularly at Neural Machine Translation (NMT) systems which will help translate English↔Odia. OdiEnCorp 2.0 includes existing English-Odia corpora and we extended the collection by several other methods of data acquisition: parallel data scraping from many websites, including Odia Wikipedia, but also optical character recognition (OCR) to extract parallel data from scanned images. Our OCR-based data extraction approach for building a parallel corpus is suitable for other low resource languages that lack in online content. The resulting OdiEnCorp 2.0 contains 98,302 sentences and 1.69 million English and 1.47 million Odia tokens. To the best of our knowledge, OdiEnCorp 2.0 is the largest Odia-English parallel corpus covering different domains and available freely for non-commercial and research purposes.

COSTRA 1.0: A Dataset of Complex Sentence Transformations
Petra Barancikova | Ondřej Bojar
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present COSTRA 1.0, a dataset of complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. This first version of the dataset is limited to sentences in Czech but the construction method is universal and we plan to use it also for other languages. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation. The hope is that with this dataset, we should be able to test semantic properties of sentence embeddings and perhaps even to find some topologically interesting “skeleton” in the sentence embedding space. A preliminary analysis using LASER, multi-purpose multi-lingual sentence embeddings suggests that the LASER space does not exhibit the desired properties.

Proceedings of the 7th Workshop on Asian Translation
Toshiaki Nakazawa | Hideki Nakayama | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Win Pa Pa | Ondřej Bojar | Shantipriya Parida | Isao Goto | Hidaya Mino | Hiroshi Manabe | Katsuhito Sudoh | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 7th Workshop on Asian Translation

Two Huge Title and Keyword Generation Corpora of Research Articles
Erion Çano | Ondřej Bojar
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent developments in sequence-to-sequence learning with neural networks have considerably improved the quality of automatically generated text summaries and document keywords, stipulating the need for even bigger training corpora. Metadata of research articles are usually easy to find online and can be used to perform research on various tasks. In this paper, we introduce two huge datasets for text summarization (OAGSX) and keyword generation (OAGKX) research, containing 34 million and 23 million records, respectively. The data were retrieved from the Open Academic Graph which is a network of research profiles and publications. We carefully processed each record and also tried several extractive and abstractive methods of both tasks to create performance baselines for other researchers. We further illustrate the performance of those methods previewing their outputs. In the near future, we would like to apply topic modeling on the two sets to derive subsets of research articles from more specific disciplines.

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Ivana Kvapilíková | Mikel Artetxe | Gorka Labaka | Eneko Agirre | Ondřej Bojar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.

WMT20 Document-Level Markable Error Exploration
Vilém Zouhar | Tereza Vojtěchová | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

Even though sentence-centric metrics are used widely in machine translation evaluation, document-level performance is at least equally important for professional usage. In this paper, we bring attention to detailed document-level evaluation focused on markables (expressions bearing most of the document meaning) and the negative impact of various markable error phenomena on the translation. For an annotation experiment of two phases, we chose Czech and English documents translated by systems submitted to WMT20 News Translation Task. These documents are from the News, Audit and Lease domains. We show that the quality and also the kind of errors varies significantly among the domains. This systematic variance is in contrast to the automatic evaluation results. We inspect which specific markables are problematic for MT systems and conclude with an analysis of the effect of markable error types on the MT performance measured by humans and automatic evaluation tools.

Efficiently Reusing Old Models Across Languages via Transfer Learning
Tom Kocmi | Ondřej Bojar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Recent progress in neural machine translation (NMT) is directed towards larger neural networks trained on an increasing amount of hardware resources. As a result, NMT models are costly to train, both financially, due to the electricity and hardware cost, and environmentally, due to the carbon footprint. It is especially true in transfer learning for its additional cost of training the “parent” model before transferring knowledge and training the desired “child” model. In this paper, we propose a simple method of re-using an already trained model for different language pairs where there is no need for modifications in model architecture. Our approach does not need a separate parent model for each investigated language pair, as it is typical in NMT transfer learning. To show the applicability of our method, we recycle a Transformer model trained by different researchers and use it to seed models for different language pairs. We achieve better translation quality and shorter convergence times than when training from random initialization.

Outbound Translation User Interface Ptakopět: A Pilot Study
Vilém Zouhar | Ondřej Bojar
Proceedings of the Twelfth Language Resources and Evaluation Conference

It is not uncommon for Internet users to have to produce a text in a foreign language they have very little knowledge of and are unable to verify the translation quality. We call the task “outbound translation” and explore it by introducing an open-source modular system Ptakopět. Its main purpose is to inspect human interaction with MT systems enhanced with additional subsystems, such as backward translation and quality estimation. We follow up with an experiment on (Czech) human annotators tasked to produce questions in a language they do not speak (German), with the help of Ptakopět. We focus on three real-world use cases (communication with IT support, describing administrative issues and asking encyclopedic questions) from which we gain insight into different strategies users take when faced with outbound translation tasks. Round trip translation is known to be unreliable for evaluating MT systems but our experimental evaluation documents that it works very well for users, at least on MT systems of mid-range quality.

2019

Idiap NMT System for WAT 2019 Multimodal Translation Task
Shantipriya Parida | Ondřej Bojar | Petr Motlicek
Proceedings of the 6th Workshop on Asian Translation

This paper describes the Idiap submission to WAT 2019 for the English-Hindi Multi-Modal Translation Task. We have used the state-of-the-art Transformer model and utilized the IITB English-Hindi parallel corpus as an additional data source. Among the different tracks of the multi-modal task, we have participated in the “Text-Only” track for the evaluation and challenge test sets. Our submission tops in its track among the competitors in terms of both automatic and manual evaluation. Based on automatic scores, our text-only submission also outperforms systems that consider visual information in the “multi-modal translation” task.

SAO WMT19 Test Suite: Machine Translation of Audit Reports
Tereza Vojtěchová | Michal Novák | Miloš Klouček | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes a machine translation test set of documents from the auditing domain and its use as one of the “test suites” in the WMT19 News Translation Task for translation directions involving Czech, English and German. Our evaluation suggests that current MT systems optimized for the general news domain can perform quite well even in the particular domain of audit reports. The detailed manual evaluation however indicates that deep factual knowledge of the domain is necessary. For the naked eye of a non-expert, translations by many systems seem almost perfect and automatic MT evaluation with one reference is practically useless for considering these details. Furthermore, we show on a sample document from the domain of agreements that even the best systems completely fail in preserving the semantics of the agreement, namely the identity of the parties.

Efficiency Metrics for Data-Driven Models: A Text Summarization Case Study
Erion Çano | Ondřej Bojar
Proceedings of the 12th International Conference on Natural Language Generation

Using data-driven models for solving text summarization or similar tasks has become very common in the last years. Yet most of the studies report basic accuracy scores only, and nothing is known about the ability of the proposed models to improve when trained on more data. In this paper, we define and propose three data efficiency metrics: data score efficiency, data time deficiency and overall data efficiency. We also propose a simple scheme that uses those metrics and apply it for a more comprehensive evaluation of popular methods on text summarization and title generation tasks. For the latter task, we process and release a huge collection of 35 million abstract-title pairs from scientific articles. Our results reveal that among the tested models, the Transformer is the most efficient on both tasks.

Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation
Dušan Variš | Ondřej Bojar
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

This work presents our ongoing research of unsupervised pretraining in neural machine translation (NMT). In our method, we initialize the weights of the encoder and decoder with two language models that are trained with monolingual data and then fine-tune the model on parallel data using Elastic Weight Consolidation (EWC) to avoid forgetting of the original language modeling task. We compare the regularization by EWC with the previous work that focuses on regularization by language modeling objectives. The positive result is that using EWC with the decoder achieves BLEU scores similar to the previous work. However, the model converges 2-3 times faster and does not require the original unlabeled training data during the fine-tuning stage. In contrast, the regularization using EWC is less effective if the original and new tasks are not closely related. We show that initializing the bidirectional NMT encoder with a left-to-right language model and forcing the model to remember the original left-to-right language modeling task limits the learning capacity of the encoder for the whole bidirectional context.

CUNI Submission for Low-Resource Languages in WMT News 2019
Tom Kocmi | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the CUNI submission to the WMT 2019 News Translation Shared Task for the low-resource languages: Gujarati-English and Kazakh-English. We participated in both language pairs in both translation directions. Our system combines transfer learning from a different high-resource language pair followed by training on backtranslated monolingual data. Thanks to the simultaneous training in both directions, we can iterate the backtranslation process. We are using the Transformer model in a constrained submission.

English-Czech Systems in WMT19: Document-Level Transformer
Martin Popel | Dominik Macháček | Michal Auersperger | Ondřej Bojar | Pavel Pecina
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe our NMT systems submitted to the WMT19 shared task in English→Czech news translation. Our systems are based on the Transformer model implemented in either Tensor2Tensor (T2T) or Marian framework. We aimed at improving the adequacy and coherence of translated documents by enlarging the context of the source and target. Instead of translating each sentence independently, we split the document into possibly overlapping multi-sentence segments. In case of the T2T implementation, this “document-level”-trained system achieves a +0.6 BLEU improvement (p < 0.05) relative to the same system applied on isolated sentences. To assess the potential effect document-level models might have on lexical coherence, we performed a semi-automatic analysis, which revealed only a few sentences improved in this aspect. Thus, we cannot draw any conclusions from this week evidence.

Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges
Qingsong Ma | Johnny Wei | Ondřej Bojar | Yvette Graham
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked to score the outputs of the translations systems competing in the WMT19 News Translation Task with automatic metrics. 13 research groups submitted 24 metrics, 10 of which are reference-less “metrics” and constitute submissions to the joint task with WMT19 Quality Estimation Task, “QE as a Metric”. In addition, we computed 11 baseline metrics, with 8 commonly applied baselines (BLEU, SentBLEU, NIST, WER, PER, TER, CDER, and chrF) and 3 reimplementations (chrF+, sacreBLEU-BLEU, and sacreBLEU-chrF). Metrics were evaluated on the system level, how well a given metric correlates with the WMT19 official manual ranking, and segment level, how well the metric correlates with human judgements of segment quality. This year, we use direct assessment (DA) as our only form of manual evaluation.

Overview of the 6th Workshop on Asian Translation
Toshiaki Nakazawa | Nobushige Doi | Shohei Higashiyama | Chenchen Ding | Raj Dabre | Hideya Mino | Isao Goto | Win Pa Pa | Anoop Kunchukuttan | Yusuke Oda | Shantipriya Parida | Ondřej Bojar | Sadao Kurohashi
Proceedings of the 6th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.

CUNI Systems for the Unsupervised News Translation Task in WMT 2019
Ivana Kvapilíková | Dominik Macháček | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we describe the CUNI translation system used for the unsupervised news shared task of the ACL 2019 Fourth Conference on Machine Translation (WMT19). We follow the strategy of Artetxe ae at. (2018b), creating a seed phrase-based system where the phrase table is initialized from cross-lingual embedding mappings trained on monolingual data, followed by a neural machine translation system trained on synthetic parallel data. The synthetic corpus was produced from a monolingual corpus by a tuned PBMT model refined through iterative back-translation. We further focus on the handling of named entities, i.e. the part of vocabulary where the cross-lingual embedding mapping suffers most. Our system reaches a BLEU score of 15.3 on the German-Czech WMT19 shared task.

A Test Suite and Manual Evaluation of Document-Level NMT at WMT19
Kateřina Rysová | Magdaléna Rysová | Tomáš Musil | Lucie Poláková | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

As the quality of machine translation rises and neural machine translation (NMT) is moving from sentence to document level translations, it is becoming increasingly difficult to evaluate the output of translation systems. We provide a test suite for WMT19 aimed at assessing discourse phenomena of MT systems participating in the News Translation Task. We have manually checked the outputs and identified types of translation errors that are relevant to document-level translation.

Keyphrase Generation: A Text Summarization Struggle
Erion Çano | Ondřej Bojar
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Authors’ keyphrases assigned to scientific articles are essential for recognizing content and topic aspects. Most of the proposed supervised and unsupervised methods for keyphrase generation are unable to produce terms that are valuable but do not appear in the text. In this paper, we explore the possibility of considering the keyphrase string as an abstractive summary of the title and the abstract. First, we collect, process and release a large dataset of scientific paper metadata that contains 2.2 million records. Then we experiment with popular text summarization neural architectures. Despite using advanced deep learning models, large quantities of data and many days of computation, our systematic evaluation on four test datasets reveals that the explored text summarization methods could not produce better keyphrases than the simpler unsupervised methods, or the existing supervised ones.

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019. Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation.

Proceedings of the 6th Workshop on Asian Translation
Toshiaki Nakazawa | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Nobushige Doi | Yusuke Oda | Ondřej Bojar | Shantipriya Parida | Isao Goto | Hidaya Mino
Proceedings of the 6th Workshop on Asian Translation

2018

Translating Short Segments with NMT: A Case Study in English-to-Hindi
Shantipriya Parida | Ondřej Bojar
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

This paper presents a case study in translating short image captions of the Visual Genome dataset from English into Hindi using out-of-domain data sets of varying size. We experiment with three NMT models: the shallow and deep sequence-tosequence and the Transformer model as implemented in Marian toolkit. Phrase-based Moses serves as the baseline. The results indicate that the Transformer model outperforms others in the large data setting in a number of automatic metrics and manual evaluation, and it also produces the fewest truncated sentences. Transformer training is however very sensitive to the hyperparameters, so it requires more experimenting. The deep sequence-to-sequence model produced more flawless outputs in the small data setting and it was generally more stable, at the cost of more training iterations.

Are BLEU and Meaning Representation in Opposition?
Ondřej Cífka | Ondřej Bojar
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

One of possible ways of obtaining continuous-space sentence representations is by training neural machine translation (NMT) systems. The recent attention mechanism however removes the single point in the neural network from which the source sentence representation can be extracted. We propose several variations of the attentive NMT architecture bringing this meeting point back. Empirical evaluation suggests that the better the translation quality, the worse the learned sentence representations serve in a wide range of classification and similarity tasks.

EvalD Reference-Less Discourse Evaluation for WMT18
Ondřej Bojar | Jiří Mírovský | Kateřina Rysová | Magdaléna Rysová
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present the results of automatic evaluation of discourse in machine translation (MT) outputs using the EVALD tool. EVALD was originally designed and trained to assess the quality of human writing, for native speakers and foreign-language learners. MT has seen a tremendous leap in translation quality at the level of sentences and it is thus interesting to see if the human-level evaluation is becoming relevant.

CUNI Submissions in WMT18
Tom Kocmi | Roman Sudarikov | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We participated in the WMT 2018 shared news translation task in three language pairs: English-Estonian, English-Finnish, and English-Czech. Our main focus was the low-resource language pair of Estonian and English for which we utilized Finnish parallel data in a simple method. We first train a “parent model” for the high-resource language pair followed by adaptation on the related low-resource language pair. This approach brings a substantial performance boost over the baseline system trained only on Estonian-English parallel data. Our systems are based on the Transformer architecture. For the English to Czech translation, we have evaluated our last year models of hybrid phrase-based approach and neural machine translation mainly for comparison purposes.

CUNI Basque-to-English Submission in IWSLT18
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 15th International Conference on Spoken Language Translation

We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English. Our submission is based on the current state-of-the-art self-attentive neural network architecture, Transformer. We further improve this strong baseline by exploiting available monolingual data using the back-translation technique. We also present further improvements gained by a transfer learning, a technique that trains a model using a high-resource language pair (Czech-English) and then fine-tunes the model using the target low-resource language pair (Basque-English).

Neural Monkey: The Current State and Beyond
Jindřich Helcl | Jindřich Libovický | Tom Kocmi | Tomáš Musil | Ondřej Cífka | Dušan Variš | Ondřej Bojar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Trivial Transfer Learning for Low-Resource Neural Machine Translation
Tom Kocmi | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Research Papers

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a “parent” model for a high-resource language pair and then continue the training on a low-resource pair only by replacing the training corpus. This “child” model performs significantly better than the baseline trained for low-resource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.

CUNI NMT System for WAT 2018 Translation Tasks
Tom Kocmi | Shantipriya Parida | Ondřej Bojar
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance
Qingsong Ma | Ondřej Bojar | Yvette Graham
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the results of the WMT18 Metrics Shared Task. We asked participants of this task to score the outputs of the MT systems involved in the WMT18 News Translation Task with automatic metrics. We collected scores of 10 metrics and 8 research groups. In addition to that, we computed scores of 8 standard metrics (BLEU, SentBLEU, chrF, NIST, WER, PER, TER and CDER) as baselines. The collected scores were evaluated in terms of system-level correlation (how well each metric’s scores correlate with WMT18 official manual ranking of systems) and in terms of segment-level correlation (how often a metric agrees with humans in judging the quality of a particular sentence relative to alternate outputs). This year, we employ a single kind of manual evaluation: direct assessment (DA).

The WMT’18 Morpheval test suites for English-Czech, English-German, English-Finnish and Turkish-English
Franck Burlot | Yves Scherrer | Vinit Ravishankar | Ondřej Bojar | Stig-Arne Grönroos | Maarit Koponen | Tommi Nieminen | François Yvon
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

Progress in the quality of machine translation output calls for new automatic evaluation procedures and metrics. In this paper, we extend the Morpheval protocol introduced by Burlot and Yvon (2017) for the English-to-Czech and English-to-Latvian translation directions to three additional language pairs, and report its use to analyze the results of WMT 2018’s participants for these language pairs. Considering additional, typologically varied source and target languages also enables us to draw some generalizations regarding this morphology-oriented evaluation procedure.

Findings of the 2018 Conference on Machine Translation (WMT18)
Ondřej Bojar | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Philipp Koehn | Christof Monz
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2018. Participants were asked to build machine translation systems for any of 7 language pairs in both directions, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. This year, we also opened up the task to additional test sets to probe specific aspects of translation.

Testsuite on Czech–English Grammatical Contrasts
Silvie Cinková | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present a pilot study of machine translation of selected grammatical contrasts between Czech and English in WMT18 News Translation Task. For each phenomenon, we run a dedicated test which checks if the candidate translation expresses the phenomenon as expected or not. The proposed type of analysis is not an evaluation in the strict sense because the phenomenon can be correctly translated in various ways and we anticipate only one. What is nevertheless interesting are the differences between various MT systems and the single reference translation in their general tendency in handling the given phenomenon.

2017

CUNI Experiments for WMT17 Metrics Task
David Mareček | Ondřej Bojar | Ondřej Hübsch | Rudolf Rosa | Dušan Variš
Proceedings of the Second Conference on Machine Translation

Curriculum Learning and Minibatch Bucketing in Neural Machine Translation
Tom Kocmi | Ondřej Bojar
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

We examine the effects of particular orderings of sentence pairs on the on-line training of neural machine translation (NMT). We focus on two types of such orderings: (1) ensuring that each minibatch contains sentences similar in some aspect and (2) gradual inclusion of some sentence types as the training progresses (so called “curriculum learning”). In our English-to-Czech experiments, the internal homogeneity of minibatches has no effect on the training but some of our “curricula” achieve a small improvement over the baseline.

Results of the WMT17 Neural MT Training Task
Ondřej Bojar | Jindřich Helcl | Tom Kocmi | Jindřich Libovický | Tomáš Musil
Proceedings of the Second Conference on Machine Translation

Variable Mini-Batch Sizing and Pre-Trained Embeddings
Mostafa Abdou | Vladan Glončák | Ondřej Bojar
Proceedings of the Second Conference on Machine Translation

Producing Unseen Morphological Variants in Statistical Machine Translation
Matthias Huck | Aleš Tamchyna | Ondřej Bojar | Alexander Fraser
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Translating into morphologically rich languages is difficult. Although the coverage of lemmas may be reasonable, many morphological variants cannot be learned from the training data. We present a statistical translation system that is able to produce these inflected word forms. Different from most previous work, we do not separate morphological prediction from lexical choice into two consecutive steps. Our approach is novel in that it is integrated in decoding and takes advantage of context information from both the source language and the target language sides.

LanideNN: Multilingual Language Identification on Character Window
Tom Kocmi | Ondřej Bojar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.

CUNI submission in WMT17: Chimera goes neural
Roman Sudarikov | David Mareček | Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the Second Conference on Machine Translation

An Exploration of Word Embedding Initialization in Deep-Learning Tasks
Tom Kocmi | Ondřej Bojar
Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017)

Results of the WMT17 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran
Proceedings of the Second Conference on Machine Translation

CUNI NMT System for WAT 2017 Translation Tasks
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

The paper presents this year’s CUNI submissions to the WAT 2017 Translation Task focusing on the Japanese-English translation, namely Scientific papers subtask, Patents subtask and Newswire subtask. We compare two neural network architectures, the standard sequence-to-sequence with attention (Seq2Seq) and an architecture using convolutional sentence encoder (FBConv2Seq), both implemented in the NMT framework Neural Monkey that we currently participate in developing. We also compare various types of preprocessing of the source Japanese sentences and their impact on the overall results. Furthermore, we include the results of our experiments with out-of-domain data obtained by combining the corpora provided for each subtask.

Proceedings of the Second Conference on Machine Translation
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Julia Kreutzer
Proceedings of the Second Conference on Machine Translation

Paying Attention to Multi-Word Expressions in Neural Machine Translation
Matīss Rikters | Ondřej Bojar
Proceedings of Machine Translation Summit XVI: Research Track

CUNI System for WMT17 Automatic Post-Editing Task
Dušan Variš | Ondřej Bojar
Proceedings of the Second Conference on Machine Translation

2016

Moses & Treex Hybrid MT Systems Bestiary
Rudolf Rosa | Martin Popel | Ondřej Bojar | David Mareček | Ondřej Dušek
Proceedings of the 2nd Deep Machine Translation Workshop

Results of the WMT16 Metrics Shared Task
Ondřej Bojar | Yvette Graham | Amir Kamran | Miloš Stanojević
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Pivoting Methods and Data for Czech-Vietnamese Translation via English
Duc Tam Hoang | Ondrej Bojar
Proceedings of the 19th Annual Conference of the European Association for Machine Translation

UFAL Submissions to the IWSLT 2016 MT Track
Ondřej Bojar | Ondřej Cífka | Jindřich Helcl | Tom Kocmi | Roman Sudarikov
Proceedings of the 13th International Conference on Spoken Language Translation

We present our submissions to the IWSLT 2016 machine translation task, as our first attempt to translate subtitles and one of our early experiments with neural machine translation (NMT). We focus primarily on English→Czech translation direction but perform also basic adaptation experiments for NMT with German and also the reverse direction. Three MT systems are tested: (1) our Chimera, a tight combination of phrase-based MT and deep linguistic processing, (2) Neural Monkey, our implementation of a NMT system in TensorFlow and (3) Nematus, an established NMT system.

Edinburgh’s Statistical Machine Translation Systems for WMT16
Philip Williams | Rico Sennrich | Maria Nădejde | Matthias Huck | Barry Haddow | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Target-Side Context for Discriminative Models in Statistical Machine Translation
Aleš Tamchyna | Alexander Fraser | Ondřej Bojar | Marcin Junczys-Dowmunt
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Using Term Position Similarity and Language Modeling for Bilingual Document Alignment
Thanh C. Le | Hoa Trong Vu | Jonathan Oberländer | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

TectoMT – a deep linguistic core of the combined Cimera MT system
Martin Popel | Roman Sudarikov | Ondřej Bojar | Rudolf Rosa | Jan Hajič
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products

CUNI-LMU Submissions in WMT2016: Chimera Constrained and Beaten
Aleš Tamchyna | Roman Sudarikov | Ondřej Bojar | Alexander Fraser
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

HUME: Human UCCA-Based Evaluation of Machine Translation
Alexandra Birch | Omri Abend | Ondřej Bojar | Barry Haddow
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Results of the WMT16 Tuning Shared Task
Bushra Jawaid | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Enriching Source for English-to-Urdu Machine Translation
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

This paper focuses on the generation of case markers for free word order languages that use case markers as phrasal clitics for marking the relationship between the dependent-noun and its head. The generation of such clitics becomes essential task especially when translating from fixed word order languages where syntactic relations are identified by the positions of the dependent-nouns. To address the problem of missing markers on source-side, artificial markers are added in source to improve alignments with its target counterparts. Up to 1 BLEU point increase is observed over the baseline on different test sets for English-to-Urdu.

Particle Swarm Optimization Submission for WMT16 Tuning Task
Viktor Kocur | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Verb sense disambiguation in Machine Translation
Roman Sudarikov | Ondřej Dušek | Martin Holub | Ondřej Bojar | Vincent Kríž
Proceedings of the Sixth Workshop on Hybrid Approaches to Translation (HyTra6)

We describe experiments in Machine Translation using word sense disambiguation (WSD) information. This work focuses on WSD in verbs, based on two different approaches – verbal patterns based on corpus pattern analysis and verbal word senses from valency frames. We evaluate several options of using verb senses in the source-language sentences as an additional factor for the Moses statistical machine translation system. Our results show a statistically significant translation quality improvement in terms of the BLEU metric for the valency frames approach, but in manual evaluation, both WSD methods bring improvements.

Bilingual Embeddings and Word Alignments for Translation Quality Estimation
Amal Abdelsalam | Ondřej Bojar | Samhaa El-Beltagy
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks
Jindřich Libovický | Jindřich Helcl | Marek Tlustý | Ondřej Bojar | Pavel Pecina
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

Dictionary-based Domain Adaptation of MT Systems without Retraining
Rudolf Rosa | Roman Sudarikov | Michal Novák | Martin Popel | Ondřej Bojar
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

2015

Findings of the 2015 Workshop on Statistical Machine Translation
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Barry Haddow | Matthias Huck | Chris Hokamp | Philipp Koehn | Varvara Logacheva | Christof Monz | Matteo Negri | Matt Post | Carolina Scarton | Lucia Specia | Marco Turchi
Proceedings of the Tenth Workshop on Statistical Machine Translation

Results of the WMT15 Metrics Shared Task
Miloš Stanojević | Amir Kamran | Philipp Koehn | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation

What a Transfer-Based System Brings to the Combination with PBMT
Aleš Tamchyna | Ondřej Bojar
Proceedings of the Fourth Workshop on Hybrid Approaches to Translation (HyTra)

CUNI in WMT15: Chimera Strikes Again
Ondřej Bojar | Aleš Tamchyna
Proceedings of the Tenth Workshop on Statistical Machine Translation

Results of the WMT15 Tuning Shared Task
Miloš Stanojević | Amir Kamran | Ondřej Bojar
Proceedings of the Tenth Workshop on Statistical Machine Translation

Proceedings of the Tenth Workshop on Statistical Machine Translation
Ondřej Bojar | Rajan Chatterjee | Christian Federmann | Barry Haddow | Chris Hokamp | Matthias Huck | Varvara Logacheva | Pavel Pecina
Proceedings of the Tenth Workshop on Statistical Machine Translation

TeamUFAL: WSD+EL as Document Retrieval
Petr Fanta | Roman Sudarikov | Ondřej Bojar
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech
Nianwen Xue | Ondřej Bojar | Jan Hajič | Martha Palmer | Zdeňka Urešová | Xiuhong Zhang
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Abstract Meaning Representations (AMRs) are rooted, directional and labeled graphs that abstract away from morpho-syntactic idiosyncrasies such as word category (verbs and nouns), word order, and function words (determiners, some prepositions). Because these syntactic idiosyncrasies account for many of the cross-lingual differences, it would be interesting to see if this representation can serve, e.g., as a useful, minimally divergent transfer layer in machine translation. To answer this question, we have translated 100 English sentences that have existing AMRs into Chinese and Czech to create AMRs for them. A cross-linguistic comparison of English to Chinese and Czech AMRs reveals both cases where the AMRs for the language pairs align well structurally and cases of linguistic divergence. We found that the level of compatibility of AMR between English and Chinese is higher than between English and Czech. We believe this kind of comparison is beneficial to further refining the annotation standards for each of the three languages and will lead to more compatible annotation guidelines between the languages.

Comparing Czech and English AMRs
Zdeňka Urešová | Jan Hajič | Ondřej Bojar
Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing

Twitter Crowd Translation – design and objectives
Eduard Šubert | Ondřej Bojar
Proceedings of Translating and the Computer 36

Two-Step Machine Translation with Lattices
Bushra Jawaid | Ondřej Bojar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The idea of two-step machine translation was introduced to divide the complexity of the search space into two independent steps: (1) lexical translation and reordering, and (2) conjugation and declination in the target language. In this paper, we extend the two-step machine translation structure by replacing state-of-the-art phrase-based machine translation with the hierarchical machine translation in the 1st step. We further extend the fixed string-based input format of the 2nd step with word lattices (Dyer et al., 2008); this provides the 2nd step with the opportunity to choose among a sample of possible reorderings instead of relying on the single best one as produced by the 1st step.

English to Urdu Statistical Machine Translation: Establishing a Baseline
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing

Proceedings of the Ninth Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Christof Monz | Matt Post | Lucia Specia
Proceedings of the Ninth Workshop on Statistical Machine Translation

CUNI in WMT14: Chimera Still Awaits Bellerophon
Aleš Tamchyna | Martin Popel | Rudolf Rosa | Ondřej Bojar
Proceedings of the Ninth Workshop on Statistical Machine Translation

Findings of the 2014 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Christian Federmann | Barry Haddow | Philipp Koehn | Johannes Leveling | Christof Monz | Pavel Pecina | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia | Aleš Tamchyna
Proceedings of the Ninth Workshop on Statistical Machine Translation

A Tagged Corpus and a Tagger for Urdu
Bushra Jawaid | Amir Kamran | Ondřej Bojar
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe a release of a sizeable monolingual Urdu corpus automatically tagged with part-of-speech tags. We extend the work of Jawaid and Bojar (2012) who use three different taggers and then apply a voting scheme to disambiguate among the different choices suggested by each tagger. We run this complex ensemble on a large monolingual corpus and release the tagged corpus. Additionally, we use this data to train a single standalone tagger which will hopefully significantly simplify Urdu processing. The standalone tagger obtains the accuracy of 88.74% on test data.

HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation
Ondřej Bojar | Vojtěch Diatka | Pavel Rychlý | Pavel Straňák | Vít Suchomel | Aleš Tamchyna | Daniel Zeman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.

Results of the WMT14 Metrics Shared Task
Matouš Macháček | Ondřej Bojar
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

PhraseFix: Statistical Post-Editing of TectoMT
Petra Galuščáková | Martin Popel | Ondřej Bojar
Proceedings of the Eighth Workshop on Statistical Machine Translation

Proceedings of the Eighth Workshop on Statistical Machine Translation
Ondrej Bojar | Christian Buck | Chris Callison-Burch | Barry Haddow | Philipp Koehn | Christof Monz | Matt Post | Herve Saint-Amand | Radu Soricut | Lucia Specia
Proceedings of the Eighth Workshop on Statistical Machine Translation

Results of the WMT13 Metrics Shared Task
Matouš Macháček | Ondřej Bojar
Proceedings of the Eighth Workshop on Statistical Machine Translation

Twenty Flavors of One Text
Daniel Zeman | Ondřej Bojar
Proceedings of the Workshop on Twenty Years of Bitext

Findings of the 2013 Workshop on Statistical Machine Translation
Ondřej Bojar | Christian Buck | Chris Callison-Burch | Christian Federmann | Barry Haddow | Philipp Koehn | Christof Monz | Matt Post | Radu Soricut | Lucia Specia
Proceedings of the Eighth Workshop on Statistical Machine Translation

Chimera – Three Heads for English-to-Czech Translation
Ondřej Bojar | Rudolf Rosa | Aleš Tamchyna
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

The Joy of Parallelism with CzEng 1.0
Ondřej Bojar | Zdeněk Žabokrtský | Ondřej Dušek | Petra Galuščáková | Martin Majliš | David Mareček | Jiří Maršík | Michal Novák | Martin Popel | Aleš Tamchyna
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.

TerrorCat: a Translation Error Categorization-based MT Quality Metric
Mark Fishel | Rico Sennrich | Maja Popović | Ondřej Bojar
Proceedings of the Seventh Workshop on Statistical Machine Translation

Morphological Processing for English-Tamil Statistical Machine Translation
Loganathan Ramasamy | Ondřej Bojar | Zdeněk Žabokrtský
Proceedings of the Workshop on Machine Translation and Parsing in Indian Languages

Towards a Predicate-Argument Evaluation for MT
Ondřej Bojar | Dekai Wu
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

Tagger Voting for Urdu
Bushra Jawaid | Ondřej Bojar
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

We introduce a substantial update of the Prague Czech-English Dependency Treebank, a parallel corpus manually annotated at the deep syntactic layer of linguistic representation. The English part consists of the Wall Street Journal (WSJ) section of the Penn Treebank. The Czech part was translated from the English source sentence by sentence. This paper gives a high level overview of the underlying linguistic theory (the so-called tectogrammatical annotation) with some details of the most important features like valency annotation, ellipsis reconstruction or coreference.

Selecting Data for English-to-Czech Machine Translation
Aleš Tamchyna | Petra Galuščáková | Amir Kamran | Miloš Stanojević | Ondřej Bojar
Proceedings of the Seventh Workshop on Statistical Machine Translation

Probes in a Taxonomy of Factored Phrase-Based Models
Ondřej Bojar | Bushra Jawaid | Amir Kamran
Proceedings of the Seventh Workshop on Statistical Machine Translation

Automatic MT Error Analysis: Hjerson Helping Addicter
Jan Berka | Ondřej Bojar | Mark Fishel | Maja Popović | Daniel Zeman
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a complex, open source tool for detailed machine translation error analysis providing the user with automatic error detection and classification, several monolingual alignment algorithms as well as with training and test corpus browsing. The tool is the result of a merge of automatic error detection and classification of Hjerson (Popović, 2011) and Addicter (Zeman et al., 2011) into the pipeline and web visualization of Addicter. It classifies errors into categories similar to those of Vilar et al. (2006), such as: morphological, reordering, missing words, extra words and lexical errors. The graphical user interface shows alignments in both training corpus and test data; the different classes of errors are colored. Also, the summary of errors can be displayed to provide an overall view of the MT system's weaknesses. The tool was developed in Linux, but it was tested on Windows too.

Terra: a Collection of Translation Error-Annotated Corpora
Mark Fishel | Ondřej Bojar | Maja Popović
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Recently the first methods of automatic diagnostics of machine translation have emerged; since this area of research is relatively young, the efforts are not coordinated. We present a collection of translation error-annotated corpora, consisting of automatically produced translations and their detailed manual translation error analysis. Using the collected corpora we evaluate the available state-of-the-art methods of MT diagnostics and assess, how well the methods perform, how they compare to each other and whether they can be useful in practice.

2011

Two-step translation with grammatical post-processing
David Mareček | Rudolf Rosa | Petra Galuščáková | Ondřej Bojar
Proceedings of the Sixth Workshop on Statistical Machine Translation

Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning
Matouš Macháček | Ondřej Bojar
Proceedings of the Sixth Workshop on Statistical Machine Translation

A Grain of Salt for the WMT Manual Evaluation
Ondřej Bojar | Miloš Ercegovčević | Martin Popel | Omar Zaidan
Proceedings of the Sixth Workshop on Statistical Machine Translation

Improving Translation Model by Monolingual Data
Ondřej Bojar | Aleš Tamchyna
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

2010 Failures in English-Czech Phrase-Based MT
Ondřej Bojar | Kamil Kos
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

Building a Bilingual ValLex Using Treebank Token Alignment: First Observations
Jana Šindlerová | Ondřej Bojar
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We explore the potential and limitations of a concept of building a bilingual valency lexicon based on the alignment of nodes in a parallel treebank. Our aim is to build an electronic Czech->English Valency Lexicon by collecting equivalences from bilingual treebank data and storing them in two already existing electronic valency lexicons, PDT-VALLEX and Engvallex. For this task a special annotation interface has been built upon the TrEd editor, allowing quick and easy collecting of frame equivalences in either of the source lexicons. The issues encountered so far include limitations of technical character, theory-dependent limitations and limitations concerning the achievable degree of quality of human annotation. The issues of special interest for both linguists and MT specialists involved in the project include linguistically motivated non-balance between the frame equivalents, either in number or in type of valency participants. The first phases of annotation so far attest the assumption that there is a unique correspondence between the functors of the translation-equivalent frames. Also, hardly any linguistically significant non-balance between the frames has been found, which is partly promising considering the linguistic theory used and partly caused by little stylistic variety of the annotated corpus texts.

Data Issues in English-to-Hindi Machine Translation
Ondřej Bojar | Pavel Straňák | Daniel Zeman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Statistical machine translation to morphologically richer languages is a challenging task and more so if the source and target languages differ in word order. Current state-of-the-art MT systems thus deliver mediocre results. Adding more parallel data often helps improve the results; if it doesn't, it may be caused by various problems such as different domains, bad alignment or noise in the new data. In this paper we evaluate the English-to-Hindi MT task from this data perspective. We discuss several available parallel data sources and provide cross-evaluation results on their combinations using two freely available statistical MT systems. We demonstrate various problems encountered in the data and describe automatic methods of data cleaning and normalization. We also show that the contents of two independently distributed data sets can unexpectedly overlap, which negatively affects translation quality. Together with the error analysis, we also present a new tool for viewing aligned corpora, which makes it easier to detect difficult parts in the data even for a developer not speaking the target language.

Evaluating Utility of Data Sources in a Large Parallel Czech-English Corpus CzEng 0.9
Ondřej Bojar | Adam Liška | Zdeněk Žabokrtský
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

CzEng 0.9 is the third release of a large parallel corpus of Czech and English. For the current release, CzEng was extended by significant amount of texts from various types of sources, including parallel web pages, electronically available books and subtitles. This paper describes and evaluates filtering techniques employed in the process in order to avoid misaligned or otherwise damaged parallel sentences in the collection. We estimate the precision and recall of two sets of filters. The first set was used to process the data before their inclusion into CzEng. The filters from the second set were newly created to improve the filtering process for future releases of CzEng. Given the overall amount and variance of sources of the data, our experiments illustrate the utility of parallel data sources with respect to extractable parallel segments. As a similar behaviour can be expected for other language pairs, our results can be interpreted as guidelines indicating which sources should other researchers exploit first.

Tackling Sparse Data Issue in Machine Translation Evaluation
Ondřej Bojar | Kamil Kos | David Mareček
Proceedings of the ACL 2010 Conference Short Papers

2009

English-Czech MT in 2008
Ondřej Bojar | David Mareček | Václav Novák | Martin Popel | Jan Ptáček | Jan Rouš | Zdeněk Žabokrtský
Proceedings of the Fourth Workshop on Statistical Machine Translation

Computer-aided translation backed by machine translation
Ondřej Odcházal | Ondřej Bojar
Proceedings of Translating and the Computer 31

2008

Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation
Ondřej Bojar | Jan Hajič
Proceedings of the Third Workshop on Statistical Machine Translation

CzEng 0.7: Parallel Corpus with Community-Supplied Translations
Ondřej Bojar | Miroslav Janíček | Zdeněk Žabokrtský | Pavel Češka | Peter Beňa
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes CzEng 0.7, a new release of Czech-English parallel corpus freely available for research and educational purposes. We provide basic statistics of the corpus and focus on data produced by a community of volunteers. Anonymous contributors manually correct the output of a machine translation (MT) system, generating on average 2000 sentences a month, 70% of which are indeed correct translations. We compare the utility of community-supplied and of professionally translated training data for a baseline English-to-Czech MT system.

2007

English-to-Czech Factored Machine Translation
Ondřej Bojar
Proceedings of the Second Workshop on Statistical Machine Translation

Moses: Open Source Toolkit for Statistical Machine Translation
Philipp Koehn | Hieu Hoang | Alexandra Birch | Chris Callison-Burch | Marcello Federico | Nicola Bertoldi | Brooke Cowan | Wade Shen | Christine Moran | Richard Zens | Chris Dyer | Ondřej Bojar | Alexandra Constantin | Evan Herbst
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

Czech-English Word Alignment
Ondřej Bojar | Magdelena Prokopová
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe an experiment with Czech-English word alignment. Half a thousand sentences were manually annotated by two annotators in parallel and the most frequent reasons for disagreement are described. We evaluate the accuracy of GIZA++ alignment toolkit on the data and identify that lemmatization of the Czech part can reduce alignment error to a half. Furthermore we document that about 38% of tokens difficult for GIZA++ were difficult for humans already.

2005

An MT System Recycled
Ondřej Bojar | Petr Homola | Vladislav Kuboň
Proceedings of Machine Translation Summit X: Posters

This paper describes an attempt to recycle parts of the Czech-to-Russian machine translation system (MT) in the new Czech-to-English MT system. The paper describes the overall architecture of the new system and the details of the modules which have been added. A special attention is paid to the problem of named entity recognition and to the method of automatic acquisition of lexico-syntactic information for the bilingual dictionary of the system.

Problems of Reusing an Existing MT System
Ondřej Bojar | Petr Homola | Vladislav Kuboň
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

Co-authors

Martin Popel 24

Matthias Huck 22

Shantipriya Parida 21

Yvette Graham 20

Matteo Negri 20

Marco Turchi 18

Rajen Chatterjee 16

Aleš Tamchyna 16

Toshiaki Nakazawa 15

Lucia Specia 14

Antonio Jimeno Yepes 13

Peter Polák 13

Dominik Macháček 12

Makoto Morishita 10

Aurelie Neveol 10

Mariana Neves 10

Roman Grundkiewicz 9

Dávid Javorský 9

Karin Verspoor 9

Vilém Zouhar 9

Markus Freitag 8

Sadao Kurohashi 8

Masaaki Nagata 8

Michal Novák 8

Sebastian Stüker 8

Roman Sudarikov 8

Katsuhito Sudoh 8

Christian Buck 7

Marcello Federico 7

Alexander Fraser 7

Shohei Higashiyama 7

Bushra Jawaid 7

Anoop Kunchukuttan 7

Ivana Kvapilíková 7

David Mareček 7

Kenton Murray 7

Maja Popović 7

Muskaan Singh 7

Loic Barrault 6

Roldano Cattoni 6

Marta R. Costa-jussà 6

Chenchen Ding 6

André F. T. Martins 6

Elizabeth Salesky 6

Rico Sennrich 6

Philip Williams 6

François Yvon 6

Zdeněk Žabokrtský 6

Antonios Anastasopoulos 5

Satya Ranjan Dash 5

Akiko Eriguchi 5

Tirthankar Ghosal 5

Satoshi Nakamura 5

Sangeet Sagar 5

Felix Schneider 5

Matthias Sperber 5

Miloš Stanojević 5

Marcos Zampieri 5

Idris Abdulmumin 4

João Paulo Aires 4

Ebrahim Ansari 4

Rachel Bawden 4

Fethi Bougares 4

Chiara Canton 4

Anton Dvorkovich 4

Petra Galuščáková 4

Jindřich Helcl 4

Jonáš Kratochvíl 4

Varvara Logacheva 4

Petr Motlicek 4

Hideki Nakayama 4

Anna Nedoluzhko 4

Thai-Son Nguyen 4

Ibrahim Said Ahmad 3

Eleftherios Avramidis 3

Luisa Bentivogli 3

Sunit Bhattacharya 3

Frédéric Blain 3

Franck Burlot 3

Chris Callison-Burch 3

Mauro Cettolo 3

Ondřej Cífka 3

Ondřej Dušek 3

Yannick Estève 3

Dario Franceschini 3

Marie Hledíková 3

Miroslav Hrabal 3

Jindřich Libovický 3

Matouš Macháček 3

Prashant Mathur 3

Evgeny Matusov 3

John Philip McCrae 3

Shamsuddeen Hassan Muhammad 3

Tomáš Musil 3

Maria Nadejde 3

Atul Kr. Ojha 3

Ngoc-Quan Pham 3

Mariya Shmatova 3

Ivan Simonini 3

Brian Thompson 3

Zdenka Uresova 3

Changhan Wang 3

Shinji Watanabe 3

Rodolfo Zevallos 3

Matúš Žilinec 3

Petra Barancikova 2

Pushpak Bhattacharyya 2

Magdalena Biesialska 2

Alexandra Birch 2

Marine Carpuat 2

Silvie Cinková 2

Amulya Ratna Dash 2

Nobushige Doi 2

Qianqian Dong 2

Souhir Gahbiche 2

Liane Guillou 2

Marzena Karpinska 2

Kazutaka Kinugawa 2

Věra Kloudová 2

Guneet Singh Kohli 2

Mateusz Krubiński 2

Vladislav Kubon 2

Mohammad Mahmoudi 2

Debasish Kumar Mallick 2

Hiroshi Manabe 2

Benjamin Marie 2

Nitika Mathur 2

Kristýna Neumannová 2

John E. Ortega 2

Subhadarshi Panda 2

Priyanka Pattnaik 2

Jan-Thorsten Peter 2

Mārcis Pinnis 2

Raphael Rubino 2

Kateřina Rysová 2

Magdaléna Rysová 2

Herve Saint-Amand 2

Carolina Scarton 2

Nivedita Sethiya 2

Claytone Sikasote 2

Steinþór Steingrímsson 2

Craig Stewart 2

Pavel Straňák 2

Jörg Tiedemann 2

Tereza Vojtěchová 2

Patrick Wilken 2

Lisa Yankovskaya 2

Jana Šindlerová 2

Eduard Šubert 2

Amal Abdelsalam 1

Mostafa Abdou 1

Milind Agarwal 1

Victor Agostinelli 1

Sweta Agrawal 1

Ibrahim Sa’id Ahmad 1

Farhad Akhbardeh 1

Tamer Alkhouli 1

Alexandre Allauzen 1

Tanel Alumäe 1

Kwabena Amponsah-Kaakyire 1

Arkady Arkhangorodsky 1

Ekaterina Artemova 1

Mikel Artetxe 1

Michal Auersperger 1

Lauriane Aufrant 1

Amittai Axelrod 1

Jasmijn Bastings 1

Bello Shehu Bello 1

Nicola Bertoldi 1

Satya Prakash Biswal 1

Fabienne Braune 1

Jacob Bremerman 1

Vishrav Chaudhary 1

Khalid Choukri 1

Alexandra Chronopoulou 1

Alexandra Constantin 1

Musa Abdullahi Dawud 1

Sayan Deb Sarkar 1

Thierry Declerck 1

Debashish Dhal 1

Vojtěch Diatka 1

Georgiana Dinu 1

Konstantin Dranch 1

Sergey Dukanov 1

Nadir Durrani 1

Samhaa R. El-Beltagy 1

Clara Emmanuel 1

Miloš Ercegovčević 1

Cristina España-Bonet 1

Marina Fomicheva 1

George Foster 1

Bashir Shehu Galadanci 1

Ondrej Glembek 1

Vladan Glončák 1

Adelheid Glott 1

Cristian Grozea 1

Stig-Arne Grönroos 1

Michael Hanna 1

Leonie Harter 1

Kenneth Heafield 1

Almut Silja Hildebrand 1

Robin L. Hill 1

Duc Tam Hoang 1

Christopher Homan 1

Shujian Huang (书剑黄) 1

Ondřej Hübsch 1

Hirofumi Inaguma 1

Miroslav Janíček 1

Vojtěch John 1

Marcin Junczys-Dowmunt 1

Daniela Jurášová 1

Habeebah Kakudi 1

Yasumasa Kano 1

Marek Kasztelnik 1

Daniel Khashabi 1

Madeleine Kittner 1

Miloš Klouček 1

František Kmječ 1

Rebecca Knowles 1

Elena Knyazeva 1

Maarit Koponen 1

Fortuné Kponou 1

Julia Kreutzer 1

Vincent Kríž 1

Surafel Lakew 1

Howard Lakougna 1

Thomas Lavergne 1

Smruti Smita Lenka 1

Johannes Leveling 1

Yvonne Lichtblau 1

Nikola Ljubešić 1

Nicholas Lourie 1

Jessica Lundin 1

Martin Majliš 1

Shervin Malmasi 1

Jiří Maršík 1

Chandresh Maurya 1

Chandresh Kumar Maurya 1

Salima Mdhaffar 1

Marie Mikulová 1

Christine Moran 1

Yasmin Moslem 1

Carlos Mullov 1

Jiří Mírovský 1

Mathias Müller 1

Biranchi Narayan Nayak 1

Tuan-Nam Nguyen 1

Tommi Nieminen 1

Václav Novák 1

Jon Oberlander 1

Mateo Obregón 1

Ondřej Odcházal 1

Martha Palmer 1

Jarmila Panevová 1

Sergio Penkale 1

Stefano Perrella 1

Lucie Poláková 1

Adam Pospíšil 1

Piotr Połeć 1

Borek Požár 1

Lorenzo Proietti 1

Magdelena Prokopová 1

Loganathan Ramasamy 1

Vinit Ravishankar 1

Matīss Rikters 1

Elijah Rippeth 1

Roland Roller 1

Pavel Rychlý 1

Shashikanta Sahoo 1

Kalyanamalini Sahoo 1

Ashwin Sankar 1

Balaram Sarkar 1

Beatrice Savoldi 1

Yves Scherrer 1

Armin Schweinfurth 1

Sambit Sekhar 1

Jiří Semecký 1

Kirill Semenov 1

Yashvardhan Sharma 1

Sambit Shekhar 1

Kartik Shinde 1

Vojtěch Srdečný 1

Sebastian Stücker 1

Allahsera Auguste Tapo 1

Klára Tauchmanová 1

Philippe Thomas 1

Marek Tlustý 1

Saskia Trescher 1

Iryna Tryhubyshyn 1

Yogesh Virkar 1

Valentin Vydrin 1

Mingxuan Wang 1

Seema Wazarkar 1

Matthew Wiesner 1

Phil Williams 1

Marcely Zanon Boito 1

Petr Zemánek 1

Xiuhong Zhang 1

Lonneke van der Plas 1

Pavel Češka 1

Dominika Ďurišková 1

Valters Šics 1

Zuzana Šimečková 1

Miroslav Štola 1

Jan Štěpánek 1

Vojtěch Švandelík 1

Venues