Vit Suchomel

Also published as: Vít Suchomel


2022

pdf bib
MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Marta Bañón | Miquel Esplà-Gomis | Mikel L. Forcada | Cristian García-Romero | Taja Kuzman | Nikola Ljubešić | Rik van Noord | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Peter Rupnik | Vít Suchomel | Antonio Toral | Tobias van der Werff | Jaume Zaragoza
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

We introduce the project “MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages”, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from carefully selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release successive versions of the free/open-source web crawling and curation software used.

2020

pdf bib
Current Challenges in Web Corpus Building
Miloš Jakubíček | Vojtěch Kovář | Pavel Rychlý | Vit Suchomel
Proceedings of the 12th Web as Corpus Workshop

In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.

2016

pdf bib
DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation–Maximization and Chunk-based Language Model
Ondřej Herman | Vít Suchomel | Vít Baisa | Pavel Rychlý
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

In this paper we investigate two approaches to discrimination of similar languages: Expectation–maximization algorithm for estimating conditional probability P(word|language) and byte level language models similar to compression-based language modelling methods. The accuracy of these methods reached respectively 86.6% and 88.3% on set A of the DSL Shared task 2016 competition.

2014

pdf bib
HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation
Ondřej Bojar | Vojtěch Diatka | Pavel Rychlý | Pavel Straňák | Vít Suchomel | Aleš Tamchyna | Daniel Zeman
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.

pdf bib
Finding Terms in Corpora for Many Languages with the Sketch Engine
Miloš Jakubíček | Adam Kilgarriff | Vojtěch Kovář | Pavel Rychlý | Vít Suchomel
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics