2024
pdf
bib
abs
Two Approaches to Diachronic Normalization of Polish Texts
Kacper Dudzic
|
Filip Gralinski
|
Krzysztof Jassem
|
Marek Kubis
|
Piotr Wierzchon
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
This paper discusses two approaches to the diachronic normalization of Polish texts: a rule-based solution that relies on a set of handcrafted patterns, and a neural normalization model based on the text-to-text transfer transformer architecture. The training and evaluation data prepared for the task are discussed in detail, along with experiments conducted to compare the proposed normalization solutions. A quantitative and qualitative analysis is made. It is shown that at the current stage of inquiry into the problem, the rule-based solution outperforms the neural one on 3 out of 4 variants of the prepared dataset, although in practice both approaches have distinct advantages and disadvantages.
pdf
bib
abs
kubapok@LT-EDI 2024: Evaluating Transformer Models for Hate Speech Detection in Tamil
Jakub Pokrywka
|
Krzysztof Jassem
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion
We describe the second-place submission for the shared task organized at the Fourth Workshop on Language Technology for Equality, Diversity, and Inclusion (LT-EDI-2024). The task focuses on detecting caste/migration hate speech in Tamil. The included texts involve the Tamil language in both Tamil script and transliterated into Latin script, with some texts also in English. Considering different scripts, we examined the performance of 12 transformer language models on the dev set. Our analysis revealed that for the whole dataset, the model google/muril-large-cased performs the best. We used an ensemble of several models for the final challenge submission, achieving 0.81 for the test dataset.
2022
pdf
bib
abs
nEYron: Implementation and Deployment of an MT System for a Large Audit & Consulting Corporation
Artur Nowakowski
|
Krzysztof Jassem
|
Maciej Lison
|
Rafał Jaworski
|
Tomasz Dwojak
|
Karolina Wiater
|
Olga Posesor
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
This paper reports on the implementation and deployment of an MT system in the Polish branch of EY Global Limited. The system supports standard CAT and MT functionalities such as translation memory fuzzy search, document translation and post-editing, and meets less common, customer-specific expectations. The deployment began in August 2018 with a Proof of Concept, and ended with the signing of the Final Version acceptance certificate in October 2021. We present the challenges that were faced during the deployment, particularly in relation to the security check and installation processes in the production environment.
pdf
bib
abs
POLENG MT: An Adaptive MT Platform
Artur Nowakowski
|
Krzysztof Jassem
|
Maciej Lison
|
Kamil Guttmann
|
Mikołaj Pokrywka
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
We introduce POLENG MT, an MT platform that may be used as a cloud web application or as an on-site solution. The platform is capable of providing accurate document translation, including the transfer of document formatting between the input document and the output document. The main feature of the on-site version is dedicated customer adaptation, which consists of training on specialized texts and applying forced terminology translation according to the user’s needs.
pdf
bib
abs
Challenging America: Modeling language in longer time scales
Jakub Pokrywka
|
Filip Graliński
|
Krzysztof Jassem
|
Karol Kaczmarek
|
Krzysztof Jurkiewicz
|
Piotr Wierzchon
Findings of the Association for Computational Linguistics: NAACL 2022
The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.
2021
pdf
bib
abs
Neural Machine Translation with Inflected Lexicon
Artur Nowakowski
|
Krzysztof Jassem
Proceedings of Machine Translation Summit XVIII: Research Track
The paper presents experiments in neural machine translation with lexical constraints into a morphologically rich language. In particular and we introduce a method and based on constrained decoding and which handles the inflected forms of lexical entries and does not require any modification to the training data or model architecture. To evaluate its effectiveness and we carry out experiments in two different scenarios: general and domain-specific. We compare our method with baseline translation and i.e. translation without lexical constraints and in terms of translation speed and translation quality. To evaluate how well the method handles the constraints and we propose new evaluation metrics which take into account the presence and placement and duplication and inflectional correctness of lexical terms in the output sentence.
pdf
bib
abs
Neural Translator Designed to Protect the Eastern Border of the European Union
Artur Nowakowski
|
Krzysztof Jassem
Proceedings of Machine Translation Summit XVIII: Users and Providers Track
This paper reports on a translation engine designed for the needs of the Polish State Border Guard. The engine is a component of the AI Searcher system, whose aim is to search for Internet texts, written in Polish, Russian, Ukrainian or Belarusian, which may lead to criminal acts at the eastern border of the European Union. The system is intended for Polish users, and the translation engine should serve to assist understanding of non-Polish documents. The engine was trained on general-domain texts. The adaptation for the criminal domain consisted in the appropriate translation of criminal terms and proper names, such as forenames, surnames and geographical objects. The translation process needs to take into account the rich inflection found in all of the languages of interest. To this end, a method based on constrained decoding that incorporates an inflected lexicon into a neural translation process was applied in the engine.
2009
pdf
bib
An Environment for Named Entity Recognition and Translation
Filip Graliński
|
Krzysztof Jassem
|
Michał Marcińczuk
Proceedings of the 13th Annual Conference of the European Association for Machine Translation
2004
pdf
bib
Applying Oxford-PWN English-Polish dictionary to machine translation
Krzysztof Jassem
Proceedings of the 9th EAMT Workshop: Broadening horizons of machine translation and its applications
2000
pdf
bib
POLENG–Adjusting a Rule-Based Polish–English Machine Translation System by Means of Corpus Analysis
Krzysztof Jassem
|
Filip Graliński
|
Grzegorz Krynicki
5th EAMT Workshop: Harvesting Existing Resources
1997
pdf
bib
A Polish-to-English Text-to-text Translation System Based on an Electronic Dictionary
Krzysztof Jassem
Spoken Language Translation