Jindřich Helcl - ACL Anthology

Jindřich Helcl

Also published as: Jindrich Helcl

2026

Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors
Adnan Al Ali | Jindřich Helcl | Jindřich Libovický
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.

OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects

Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During the development we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages.

2025

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Atyaephyra at SemEval-2025 Task 4: Low-Rank Negative Preference Optimization
Jan Bronec | Jindřich Helcl
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We present a submission to the SemEval 2025 shared task on unlearning sensitive content from LLMs. Our approach employs negative preference optimization using low-rank adaptation. We show that we can utilize this combination to cheaply compute additional regularization terms, which help with unlearning stabilization. The results of our approach significantly exceed the shared task baselines.

We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs).Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The very high number of submissions highlights the interest of the community in hallucination detection. We present the results of the participating systems and provide an empirical analysis in order to better understand the factors that can lead to strong performance in this task. We also underscore current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.

We describe the progress of the High Performance Language Technologies (HPLT) project, a 3-year EU-funded project that started in September 2022. We focus on the up-to-date results on the release of free text datasets derived from web crawls, one of the central objectives of the project. The second release used a revised processing pipeline, and an enlarged set of input crawls. From 4.5 petabytes of web crawls we extracted 7.6T tokens of monolingual text in 193 languages, plus 380 million parallel sentences in 51 language pairs. We also release MultiHPLT, a cross-combination of the parallel data, which produces 1,275 pairs, as well as releasing the containing documents for all parallel sentences in order to enable research in document-level MT. We report changes in the pipeline, analysis and evaluation results for the second parallel data release based on machine translation systems. All datasets are released under a permissive CC0 licence.

2024

Lexically Grounded Subword Segmentation
Jindřich Libovický | Jindřich Helcl
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation precision on morpheme boundaries and improved Rényi efficiency in 8 languages. Although the proposed tokenization methods do not have a large impact on automatic translation quality, we observe consistent performance gains in the arguably more morphological task of part-of-speech tagging.

Charles Translator: A Machine Translation System between Ukrainian and Czech
Martin Popel | Lucie Polakova | Michal Novák | Jindřich Helcl | Jindřich Libovický | Pavel Straňák | Tomas Krabac | Jaroslava Hlavacova | Mariia Anisimova | Tereza Chlanova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present Charles Translator, a machine translation system between Ukrainian and Czech, developed as part of a society-wide effort to mitigate the impact of the Russian-Ukrainian war on individuals and society. The system was developed in the spring of 2022 with the help of many language data providers in order to quickly meet the demand for such a service, which was not available at the time in the required quality. The translator was later implemented as an online web interface and as an Android app with speech input, both featuring Cyrillic-Latin script transliteration. The system translates directly, in comparison to other available systems that use English as a pivot, and thus makes advantage of the typological similarity of the two languages. It uses the block back-translation method which allows for efficient use of monolingual training data. The paper describes the development process including data collection and implementation, evaluation, mentions several use cases and outlines possibilities for further development of the system for educational purposes.

HPLT’s First Release of Data and Models
Nikolay Arefyev | Mikko Aulamo | Pinzhen Chen | Ona de Gibert | Barry Haddow | Jindřich Helcl | Bhavitvya Malik | Gema Ramírez-Sánchez | Pavel Stepachev | Jörg Tiedemann | Dušan Variš | Jaume Zaragoza
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 2)

The High Performance Language Technologies (HPLT) project is a 3-year EU-funded project that started in September 2022. It aims to deliver free, sustainable, and reusable datasets, models, and workflows at scale using high-performance computing. We describe the first results of the project. The data release includes monolingual data in 75 languages at 5.6T tokens and parallel data in 18 language pairs at 96M pairs, derived from 1.8 petabytes of web crawls. Building upon automated and transparent pipelines, the first machine translation (MT) models as well as large language models (LLMs) have been trained and released. Multiple data processing tools and pipelines have also been made public.

Teaching LLMs at Charles University: Assignments and Activities
Jindřich Helcl | Zdeněk Kasner | Ondřej Dušek | Tomasz Limisiewicz | Dominik Macháček | Tomáš Musil | Jindřich Libovický
Proceedings of the Sixth Workshop on Teaching NLP

This paper presents teaching materials, particularly assignments and ideas for classroom activities, from a new course on large language modelsThe assignments include experiments with LLM inference for weather report generation and machine translation.The classroom activities include class quizzes, focused research on downstream tasks and datasets, and an interactive “best paper” session aimed at reading and comprehension of research papers.

CUNI and LMU Submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval
Katharina Hämmerl | Andrei-Alexandru Manea | Gianluca Vico | Jindřich Helcl | Jindřich Libovický
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering.Our solutions to the subtasks are based on data acquisition and model adaptation.We compare the performance of our submitted systems with the translate-test approachwhich proved to be the most useful in the previous edition of the shared task.Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.Our code is available at https://github.com/ufal/mrl2024-multilingual-ir-shared-task.

2023

CUNI Submission to MRL 2023 Shared Task on Multi-lingual Multi-task Information Retrieval
Jindřich Helcl | Jindřich Libovický
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)

2022

Non-Autoregressive Machine Translation: It’s Not as Fast as it Seems
Jindřich Helcl | Barry Haddow | Alexandra Birch
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Efficient machine translation models are commercially important as they can increase inference speeds, and reduce costs and carbon emissions. Recently, there has been much interest in non-autoregressive (NAR) models, which promise faster translation. In parallel to the research on NAR models, there have been successful attempts to create optimized autoregressive models as part of the WMT shared task on efficient translation. In this paper, we point out flaws in the evaluation methodology present in the literature on NAR models and we provide a fair comparison between a state-of-the-art NAR model and the autoregressive submissions to the shared task. We make the case for consistent evaluation of NAR models, and also for the importance of comparing NAR models with other widely used methods for improving efficiency. We run experiments with a connectionist-temporal-classification-based (CTC) NAR model implemented in C++ and compare it with AR models using wall clock times. Our results show that, although NAR models are faster on GPUs, with small batch sizes, they are almost always slower under more realistic usage conditions. We call for more realistic and extensive evaluation of NAR models in future work.

Survey of Low-Resource Machine Translation
Barry Haddow | Rachel Bawden | Antonio Valerio Miceli Barone | Jindřich Helcl | Alexandra Birch
Computational Linguistics, Volume 48, Issue 3 - September 2022

We present a survey covering the state of the art in low-resource machine translation (MT) research. There are currently around 7,000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models. There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available. We present a summary of this topical research field and provide a description of the techniques evaluated by researchers in several recent shared tasks in low-resource MT.

CUNI Non-Autoregressive System for the WMT 22 Efficient Translation Shared Task
Jindřich Helcl
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present a non-autoregressive system submission to the WMT 22 Efficient Translation Shared Task. Our system was used by Helcl et al. (2022) in an attempt to provide fair comparison between non-autoregressive and autoregressive models. This submission is an effort to establish solid baselines along with sound evaluation methodology, particularly in terms of measuring the decoding speed. The model itself is a 12-layer Transformer model trained with connectionist temporal classification on knowledge-distilled dataset by a strong autoregressive teacher model.

CUNI Systems for the WMT 22 Czech-Ukrainian Translation Task
Martin Popel | Jindřich Libovický | Jindřich Helcl
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present Charles University submissions to the WMT 22 GeneralTranslation Shared Task on Czech-Ukrainian and Ukrainian-Czech machine translation. We present two constrained submissions based on block back-translation and tagged back-translation and experiment with rule-basedromanization of Ukrainian. Our results show that the romanization onlyhas a minor effect on the translation quality. Further, we describe Charles Translator,a system that was developed in March 2022 as a response to the migrationfrom Ukraine to the Czech Republic. Compared to our constrained systems,it did not use the romanization and used some proprietary data sources.

2021

In the media industry and the focus of global reporting can shift overnight. There is a compelling need to be able to develop new machine translation systems in a short period of time and in order to more efficiently cover quickly developing stories. As part of the EU project GoURMET and which focusses on low-resource machine translation and our media partners selected a surprise language for which a machine translation system had to be built and evaluated in two months(February and March 2021). The language selected was Pashto and an Indo-Iranian language spoken in Afghanistan and Pakistan and India. In this period we completed the full pipeline of development of a neural machine translation system: data crawling and cleaning and aligning and creating test sets and developing and testing models and and delivering them to the user partners. In this paperwe describe rapid data creation and experiments with transfer learning and pretraining for this low-resource language pair. We find that starting from an existing large model pre-trained on 50languages leads to far better BLEU scores than pretraining on one high-resource language pair with a smaller model. We also present human evaluation of our systems and which indicates that the resulting systems perform better than a freely available commercial system when translating from English into Pashto direction and and similarly when translating from Pashto into English.

The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task
Pinzhen Chen | Jindřich Helcl | Ulrich Germann | Laurie Burchell | Nikolay Bogoychev | Antonio Valerio Miceli Barone | Jonas Waldendorf | Alexandra Birch | Kenneth Heafield
Proceedings of the Sixth Conference on Machine Translation

This paper presents the University of Edinburgh’s constrained submissions of English-German and English-Hausa systems to the WMT 2021 shared task on news translation. We build En-De systems in three stages: corpus filtering, back-translation, and fine-tuning. For En-Ha we use an iterative back-translation approach on top of pre-trained En-De models and investigate vocabulary embedding mapping.

2020

Expand and Filter: CUNI and LMU Systems for the WNGT 2020 Duolingo Shared Task
Jindřich Libovický | Zdeněk Kasner | Jindřich Helcl | Ondřej Dušek
Proceedings of the Fourth Workshop on Neural Generation and Translation

We present our submission to the Simultaneous Translation And Paraphrase for Language Education (STAPLE) challenge. We used a standard Transformer model for translation, with a crosslingual classifier predicting correct translations on the output n-best list. To increase the diversity of the outputs, we used additional data to train the translation model, and we trained a paraphrasing model based on the Levenshtein Transformer architecture to generate further synonymous translations. The paraphrasing results were again filtered using our classifier. While the use of additional data and our classifier filter were able to improve results, the paraphrasing model produced too many invalid outputs to further improve the output quality. Our model without the paraphrasing component finished in the middle of the field for the shared task, improving over the best baseline by a margin of 10-22 % weighted F1 absolute.

2019

CUNI System for the WMT19 Robustness Task
Jindřich Helcl | Jindřich Libovický | Martin Popel
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We present our submission to the WMT19 Robustness Task. Our baseline system is the Charles University (CUNI) Transformer system trained for the WMT18 shared task on News Translation. Quantitative results show that the CUNI Transformer system is already far more robust to noisy input than the LSTM-based baseline provided by the task organizers. We further improved the performance of our model by fine-tuning on the in-domain noisy data without influencing the translation quality on the news domain.

2018

CUNI System for the WMT18 Multimodal Translation Task
Jindřich Helcl | Jindřich Libovický | Dušan Variš
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We present our submission to the WMT18 Multimodal Translation Task. The main feature of our submission is applying a self-attentive network instead of a recurrent neural network. We evaluate two methods of incorporating the visual features in the model: first, we include the image representation as another input to the network; second, we train the model to predict the visual features and use it as an auxiliary objective. For our submission, we acquired both textual and multimodal additional data. Both of the proposed methods yield significant improvements over recurrent networks and self-attentive textual baselines.

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification
Jindřich Libovický | Jindřich Helcl
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in parallel. We present a novel non-autoregressive architecture based on connectionist temporal classification and evaluate it on the task of neural machine translation. Unlike other non-autoregressive methods which operate in several steps, our model can be trained end-to-end. We conduct experiments on the WMT English-Romanian and English-German datasets. Our models achieve a significant speedup over the autoregressive models, keeping the translation quality comparable to other non-autoregressive models.

Neural Monkey: The Current State and Beyond
Jindřich Helcl | Jindřich Libovický | Tom Kocmi | Tomáš Musil | Ondřej Cífka | Dušan Variš | Ondřej Bojar
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

Input Combination Strategies for Multi-Source Transformer Decoder
Jindřich Libovický | Jindřich Helcl | David Mareček
Proceedings of the Third Conference on Machine Translation: Research Papers

In multi-source sequence-to-sequence tasks, the attention mechanism can be modeled in several ways. This topic has been thoroughly studied on recurrent architectures. In this paper, we extend the previous work to the encoder-decoder attention in the Transformer architecture. We propose four different input combination strategies for the encoder-decoder attention: serial, parallel, flat, and hierarchical. We evaluate our methods on tasks of multimodal translation and translation with multiple source languages. The experiments show that the models are able to use multiple sources and improve over single source baselines.

2017

Results of the WMT17 Neural MT Training Task
Ondřej Bojar | Jindřich Helcl | Tom Kocmi | Jindřich Libovický | Tomáš Musil
Proceedings of the Second Conference on Machine Translation

Deep architectures for Neural Machine Translation
Antonio Valerio Miceli Barone | Jindřich Helcl | Rico Sennrich | Barry Haddow | Alexandra Birch
Proceedings of the Second Conference on Machine Translation

Attention Strategies for Multi-Source Sequence-to-Sequence Learning
Jindřich Libovický | Jindřich Helcl
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Modeling attention in neural multi-source sequence-to-sequence learning remains a relatively unexplored area, despite its usefulness in tasks that incorporate multiple source languages or modalities. We propose two novel approaches to combine the outputs of attention mechanisms over each source sequence, flat and hierarchical. We compare the proposed methods with existing techniques and present results of systematic evaluation of those methods on the WMT16 Multimodal Translation and Automatic Post-editing tasks. We show that the proposed methods achieve competitive results on both tasks.

CUNI System for the WMT17 Multimodal Translation Task
Jindřich Helcl | Jindřich Libovický
Proceedings of the Second Conference on Machine Translation

2016

Deeper Machine Translation and Evaluation for German
Eleftherios Avramidis | Vivien Macketanz | Aljoscha Burchardt | Jindrich Helcl | Hans Uszkoreit
Proceedings of the 2nd Deep Machine Translation Workshop

CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks
Jindřich Libovický | Jindřich Helcl | Marek Tlustý | Ondřej Bojar | Pavel Pecina
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

UFAL Submissions to the IWSLT 2016 MT Track
Ondřej Bojar | Ondřej Cífka | Jindřich Helcl | Tom Kocmi | Roman Sudarikov
Proceedings of the 13th International Conference on Spoken Language Translation

We present our submissions to the IWSLT 2016 machine translation task, as our first attempt to translate subtitles and one of our early experiments with neural machine translation (NMT). We focus primarily on English→Czech translation direction but perform also basic adaptation experiments for NMT with German and also the reverse direction. Three MT systems are tested: (1) our Chimera, a tight combination of phrase-based MT and deep linguistic processing, (2) Neural Monkey, our implementation of a NMT system in TensorFlow and (3) Nematus, an established NMT system.

Co-authors

Ondřej Bojar 4

Antonio Valerio Miceli-Barone 4

Jörg Tiedemann 4

Ona de Gibert 4

Laurie Burchell 3

Mariia Fedorova 3

Liane Guillou 3

Bhavitvya Malik 3

Tomáš Musil 3

Stephan Oepen 3

Gema Ramírez-Sánchez 3

Pavel Stepachev 3

Marta Bañón 2

Ondřej Cífka 2

Ondřej Dušek 2

Erik Henriksson 2

Zdeněk Kasner 2

Andrey Kutuzov 2

Veronika Laippala 2

Farrokh Mehryary 2

Vladislav Mikhailov 2

Amanda Myntti 2

Dayyán O’Brien 2

Sampo Pyysalo 2

Jonas Waldendorf 2

Jaume Zaragoza-Bernabeu 2

Mariia Anisimova 1

Marianna Apidianaki 1

Joseph Attieh 1

Eleftherios Avramidis 1

Rachel Bawden 1

Jaione Bengoetxea 1

Nikolay Bogoychev 1

Aljoscha Burchardt 1

Tereza Chlanova 1

Miquel Esplà-Gomis 1

Mikel L. Forcada 1

Ulrich Germann 1

Kenneth Heafield 1

Jaroslava Hlaváčová 1

Katharina Hämmerl 1

Jussi Karlgren 1

Mateusz Klimaszewski 1

Ville Komulainen 1

Joona Kytöniemi 1

Tomasz Limisiewicz 1

Dominik Macháček 1

Vivien Macketanz 1

Kay Macquarrie 1

Andrei-Alexandru Manea 1

David Mareček 1

Timothee Mickus 1

Petter Mæhlum 1

Michal Novák 1

Lucie Poláková 1

Juan Antonio Pérez-Ortiz 1

Alessandro Raganato 1

Egil Rønningstad 1

Fernando Sanchez-Vega 1

Sevi Sariisik 1

Yves Scherrer 1

Vincent Segonne 1

Rico Sennrich 1

Pavel Straňák 1

Roman Sudarikov 1

Víctor Sánchez-Cartagena 1

Felipe Sánchez-Martínez 1

Marek Tlustý 1

Hans Uszkoreit 1

Teemu Vahtola 1

Gianluca Vico 1

Tereza Vojtěchová 1

Raúl Vázquez 1

Jaume Zaragoza 1

Peggy van der Kreeft 1

Venues