2024
pdf
bib
abs
Advancing Digital Language Equality in Europe: A Market Study and Open-Source Solutions for Multilingual Websites
Andrejs Vasiljevs
|
Rinalds Vīksna
|
Neil Vacheva
|
Andis Lagzdiņš
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
The paper presents findings from a comprehensive market study commissioned by the European Commission, aimed at analysing multilinguality of European websites and automated website translation services across various sectors. The findings show that the majority of websites offer content in one or two languages, while only less than 25% of European websites provide content in 3 or more languages. Additionally, we introduce Web-T, a collection of open-source solutions facilitating automated website translation with a help of free MT service eTranslation provided by the European Commission and possibility to integrate other MT providers. Web-T solutions include local plug-ins for Content Management Systems, universal plug-ins, and an MT API Integrator, thus contributing to the broader goal of digital language equality in Europe.
pdf
bib
abs
Annotations for Exploring Food Tweets from Multiple Aspects
Matiss Rikters
|
Rinalds Vīksna
|
Edison Marrese-Taylor
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This research builds upon the Latvian Twitter Eater Corpus (LTEC), which is focused on the narrow domain of tweets related to food, drinks, eating and drinking. LTEC has been collected for more than 12 years and reaching almost 3 million tweets with the basic information as well as extended automatically and manually annotated metadata. In this paper we supplement the LTEC with manually annotated subsets of evaluation data for machine translation, named entity recognition, timeline-balanced sentiment analysis, and text-image relation classification. We experiment with each of the data sets using baseline models and highlight future challenges for various modelling approaches.
pdf
bib
abs
MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages
Rinalds Vīksna
|
Inguna Skadiņa
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Text sanitization is the task of detecting and removing personal information from the text. While it has been well-studied in monolingual settings, today, there is also a need for multilingual text sanitization. In this paper, we introduce MultiLeg: a parallel, multilingual named entity (NE) dataset consisting of documents from the Court of Justice of the European Union annotated with semantic categories suitable for text sanitization. The dataset is available in 8 languages, and it contains 3082 parallel text segments for each language. We also show that the pseudonymized dataset remains useful for downstream tasks.
2023
pdf
bib
abs
Large Language Models for Multilingual Slavic Named Entity Linking
Rinalds Vīksna
|
Inguna Skadiņa
|
Daiga Deksne
|
Roberts Rozis
Proceedings of the 9th Workshop on Slavic Natural Language Processing 2023 (SlavicNLP 2023)
This paper describes our submission for the 4th Shared Task on SlavNER on three Slavic languages - Czech, Polish and Russian. We use pre-trained multilingual XLM-R Language Model (Conneau et al., 2020) and fine-tune it for three Slavic languages using datasets provided by organizers. Our multilingual NER model achieves 0.896 F-score on all corpora, with the best result for Czech (0.914) and the worst for Russian (0.880). Our cross-language entity linking module achieves F-score of 0.669 in the official SlavNER 2023 evaluation.
2022
pdf
bib
abs
Assessing Multilinguality of Publicly Accessible Websites
Rinalds Vīksna
|
Inguna Skadiņa
|
Raivis Skadiņš
|
Andrejs Vasiļjevs
|
Roberts Rozis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Although information on the Internet can be shared in many languages, the language presence on the World Wide Web is very disproportionate. The problem of multilingualism on the Web, in particular access, availability and quality of information in the world’s languages, has been the subject of UNESCO focus for several decades. Making European websites more multilingual is also one of the focal targets of the Connecting Europe Facility Automated Translation (CEF AT) digital service infrastructure. In order to monitor this goal, alongside other possible solutions, CEF AT needs a methodology and easy to use tool to assess the degree of multilingualism of a given website. In this paper we investigate methods and tools that automatically analyse the language diversity of the Web and propose indicators and methodology on how to measure the multilingualism of European websites. We also introduce a prototype tool based on open-source software that helps to assess multilingualism of the Web and can be independently run at set intervals. We also present initial results obtained with our tool that allows us to conclude that multilingualism on the Web is still a problem not only at the world level, but also at the European and regional level.
2021
pdf
bib
abs
Multilingual Slavic Named Entity Recognition
Rinalds Vīksna
|
Inguna Skadina
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Named entity recognition, in particular for morphological rich languages, is challenging task due to the richness of inflected forms and ambiguity. This challenge is being addressed by SlavNER Shared Task. In this paper we describe system submitted to this task. Our system uses pre-trained multilingual BERT Language Model and is fine-tuned for six Slavic languages of this task on texts distributed by organizers. In our experiments this multilingual NER model achieved 96 F1 score on in-domain data and an F1 score of 83 on out of domain data. Entity coreference module achieved F1 score of 47.6 as evaluated by bsnlp2021 organizers.