2024
pdf
bib
abs
Enhancing Scientific Discourse: Machine Translation for the Scientific Domain
Dimitris Roussis
|
Sokratis Sofianopoulos
|
Stelios Piperidis
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora from the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the research domains of: Energy Research, Neuroscience, Cancer and Transportation. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.
2022
pdf
bib
abs
Constructing Parallel Corpora from COVID-19 News using MediSys Metadata
Dimitrios Roussis
|
Vassilis Papavassiliou
|
Sokratis Sofianopoulos
|
Prokopis Prokopidis
|
Stelios Piperidis
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents a collection of parallel corpora generated by exploiting the COVID-19 related dataset of metadata created with the Europe Media Monitor (EMM) / Medical Information System (MediSys) processing chain of news articles. We describe how we constructed comparable monolingual corpora of news articles related to the current pandemic and used them to mine about 11.2 million segment alignments in 26 EN-X language pairs, covering most official EU languages plus Albanian, Arabic, Icelandic, Macedonian, and Norwegian. Subsets of this collection have been used in shared tasks (e.g. Multilingual Semantic Search, Machine Translation) aimed at accelerating the creation of resources and tools needed to facilitate access to information in the COVID-19 emergency situation.
pdf
bib
abs
Welocalize-ARC/NKUA’s Submission to the WMT 2022 Quality Estimation Shared Task
Eirini Zafeiridou
|
Sokratis Sofianopoulos
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents our submission to the WMT 2022 quality estimation shared task and more specifically to the quality prediction sentence-level direct assessment (DA) subtask. We build a multilingual system based on the predictor–estimator architecture by using the XLM-RoBERTa transformer for feature extraction and a regression head on top of the final model to estimate the z-standardized DA labels. Furthermore, we use pretrained models to extract useful knowledge that reflect various criteria of quality assessment and demonstrate good correlation with human judgements. We optimize the performance of our model by incorporating this information as additional external features in the input data and by applying Monte Carlo dropout during both training and inference.
2018
pdf
bib
abs
The ILSP/ARC submission to the WMT 2018 Parallel Corpus Filtering Shared Task
Vassilis Papavassiliou
|
Sokratis Sofianopoulos
|
Prokopis Prokopidis
|
Stelios Piperidis
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
This paper describes the submission of the Institute for Language and Speech Processing/Athena Research and Innovation Center (ILSP/ARC) for the WMT 2018 Parallel Corpus Filtering shared task. We explore several properties of sentences and sentence pairs that our system explored in the context of the task with the purpose of clustering sentence pairs according to their appropriateness in training MT systems. We also discuss alternative methods for ranking the sentence pairs of the most appropriate clusters with the aim of generating the two datasets (of 10 and 100 million words as required in the task) that were evaluated. By summarizing the results of several experiments that were carried out by the organizers during the evaluation phase, our submission achieved an average BLEU score of 26.41, even though it does not make use of any language-specific resources like bilingual lexica, monolingual corpora, or MT output, while the average score of the best participant system was 27.91.
2015
pdf
bib
A Data Sharing and Annotation Service Infrastructure
Stelios Piperidis
|
Dimitrios Galanis
|
Juli Bakagianni
|
Sokratis Sofianopoulos
Proceedings of ACL-IJCNLP 2015 System Demonstrations
2014
pdf
bib
Expanding the Language model in a low-resource hybrid MT system
George Tambouratzis
|
Sokratis Sofianopoulos
|
Marina Vassiliou
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation
2013
pdf
bib
A Review of the PRESEMT project
George Tambouratzis
|
Marina Vassiliou
|
Sokratis Sofianopoulos........
Proceedings of Machine Translation Summit XIV: European projects
pdf
bib
Language-independent hybrid MT with PRESEMT
George Tambouratzis
|
Sokratis Sofianopoulos
|
Marina Vassiliou
Proceedings of the Second Workshop on Hybrid Approaches to Translation
2012
pdf
bib
Evaluating the Translation Accuracy of a Novel Language-Independent MT Methodology
George Tambouratzis
|
Sokratis Sofianopoulos
|
Marina Vassiliou
Proceedings of COLING 2012
pdf
bib
PRESEMT: Pattern Recognition-based Statistically Enhanced MT
George Tambouratzis
|
Marina Vassiliou
|
Sokratis Sofianopoulos
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)
pdf
bib
Implementing a Language-Independent MT Methodology
Sokratis Sofianopoulos
|
Marina Vassiliou
|
George Tambouratzis
Proceedings of the First Workshop on Multilingual Modeling
2011
pdf
bib
A resource-light phrase scheme for language-portable MT
George Tambouratzis
|
Fotini Simistira
|
Sokratis Sofianopoulos
|
Nikos Tsimboukakis
|
Marina Vassiliou
Proceedings of the 15th Annual Conference of the European Association for Machine Translation
2008
pdf
bib
abs
Evaluation of a Machine Translation System for Low Resource Languages: METIS-II
Vincent Vandeghinste
|
Peter Dirix
|
Ineke Schuurman
|
Stella Markantonatou
|
Sokratis Sofianopoulos
|
Marina Vassiliou
|
Olga Yannoutsou
|
Toni Badia
|
Maite Melero
|
Gemma Boleda
|
Michael Carl
|
Paul Schmidt
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we describe the METIS-II system and its evaluation on each of the language pairs: Dutch, German, Greek, and Spanish to English. The METIS-II system envisaged developing a data-driven approach in which no parallel corpus is required, and in which no full parser or extensive rule sets are needed. We describe evalution on a development test set and on a test set coming from Europarl, and compare our results with SYSTRAN. We also provide some further analysis, researching the impact of the number and source of the reference translations and analysing the results according to test text type. The results are expectably lower for the METIS system, but not at an unatainable distance from a mature system like SYSTRAN.
2007
pdf
bib
Demonstration of the Greek to English METIS-II system
Sokratis Sofianopoulos
|
Vassiliki Spilioti
|
Marina Vassiliou
|
Olga Yannoutsou
|
Stella Markantonatou
Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages: Papers
2006
pdf
bib
Using Patterns for Machine Translation
Stella Makantonatou
|
Sokratis Sofianopoulos
|
Vassiliki Spilioti
|
George Tambouratzis
|
Marina Vassiliou
|
Olga Yannoutsou
Proceedings of the 11th Annual Conference of the European Association for Machine Translation
2005
pdf
bib
abs
Monolingual Corpus-based MT Using Chunks
Stella Markantonatou
|
Sokratis Sofianopoulos
|
Vassiliki Spilioti
|
Yiorgos Tambouratzis
|
Marina Vassiliou
|
Olga Yannoutsou
|
Nikos Ioannou
Workshop on example-based machine translation
In the present article, a hybrid approach is proposed for implementing a machine translation system using a large monolingual corpus coupled with a bilingual lexicon and basic NLP tools. In the first phase of the METIS system, a source language (SL) sentence, after being tagged, lemmatised and translated by a flat lemma-to-lemma lexicon, was matched against a tagged and lemmatised target language (TL) corpus using a pattern matching algorithm. In the second phase, translations are generated by combining sub-sentential structures. In this paper, the main features of the second phase are discussed while the system architecture and the corresponding translation approach are presented. The proposed methodology is illustrated with examples of the translation process.