2024
pdf
bib
abs
Training an NMT system for legal texts of a low-resource language variety South Tyrolean German - Italian
Antoni Oliver
|
Sergi Alvarez-Vidal
|
Egon Stemle
|
Elena Chiocchetti
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)
This paper illustrates the process of training and evaluating NMT systems for a language pair that includes a low-resource language variety.A parallel corpus of legal texts for Italian and South Tyrolean German has been compiled, with South Tyrolean German being the low-resourced language variety. As the size of the compiled corpus is insufficient for the training, we have combined the corpus with several parallel corpora using data weighting at sentence level. We then performed an evaluation of each combination and of two popular commercial systems.
2023
pdf
bib
abs
The MT@BZ corpus: machine translation & legal language
Flavia De Camillis
|
Egon W. Stemle
|
Elena Chiocchetti
|
Francesco Fernicola
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
The paper reports on the creation, annotation and curation of the MT@BZ corpus, a bilingual (Italian–South Tyrolean German) corpus of machine-translated legal texts from the officially multilingual Province of Bolzano, Italy. It is the first human error-annotated corpus (using an adapted SCATE taxonomy) of machine-translated legal texts in this language combination that includes a lesser-used standard variety. The data of the project will be made available on GitHub and another repository. The output of the customized engine achieved notably better BLEU, TER and chrF2 scores than the baseline. Over 50% of the segments needed no human revision due to customization. The most frequent error categories were mistranslations and bilingual (legal) terminology errors. Our contribution brings fine-grained insights to Machine translation evaluation research, as it concerns a less common language combination, a lesser-used language variety and a societally relevant specialized domain. Such results are necessary to implement and inform the use of MT in institutional contexts of smaller language communities.
2020
pdf
bib
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi
|
Felix Bildhauer
|
Roland Schäfer
|
Egon Stemle
Proceedings of the 12th Web as Corpus Workshop
pdf
bib
abs
A Report on the 2020 VUA and TOEFL Metaphor Detection Shared Task
Chee Wee (Ben) Leong
|
Beata Beigman Klebanov
|
Chris Hamill
|
Egon Stemle
|
Rutuja Ubale
|
Xianyang Chen
Proceedings of the Second Workshop on Figurative Language Processing
In this paper, we report on the shared task on metaphor identification on VU Amsterdam Metaphor Corpus and on a subset of the TOEFL Native Language Identification Corpus. The shared task was conducted as apart of the ACL 2020 Workshop on Processing Figurative Language.
pdf
bib
abs
Testing the role of metadata in metaphor identification
Egon Stemle
|
Alexander Onysko
Proceedings of the Second Workshop on Figurative Language Processing
This paper describes the adaptation and application of a neural network system for the automatic detection of metaphors. The LSTM BiRNN system participated in the shared task of metaphor identification that was part of the Second Workshop of Figurative Language Processing (FigLang2020) held at the Annual Conference of the Association for Computational Linguistics (ACL2020). The particular focus of our approach is on the potential influence that the metadata given in the ETS Corpus of Non-Native Written English might have on the automatic detection of metaphors in this dataset. The article first discusses the annotated ETS learner data, highlighting some of its peculiarities and inherent biases of metaphor use. A series of evaluations follow in order to test whether specific metadata influence the system performance in the task of automatic metaphor identification. The system is available under the APLv2 open-source license.
2018
pdf
bib
abs
Using Language Learner Data for Metaphor Detection
Egon Stemle
|
Alexander Onysko
Proceedings of the Workshop on Figurative Language Processing
This article describes the system that participated in the shared task on metaphor detection on the Vrije University Amsterdam Metaphor Corpus (VUA). The ST was part of the workshop on processing figurative language at the 16th annual conference of the North American Chapter of the Association for Computational Linguistics (NAACL2018). The system combines a small assertion of trending techniques, which implement matured methods from NLP and ML; in particular, the system uses word embeddings from standard corpora and from corpora representing different proficiency levels of language learners in a LSTM BiRNN architecture. The system is available under the APLv2 open-source license.
2016
pdf
bib
Proceedings of the 10th Web as Corpus Workshop
Paul Cook
|
Stefan Evert
|
Roland Schäfer
|
Egon Stemle
Proceedings of the 10th Web as Corpus Workshop
pdf
bib
bot.zen @ EmpiriST 2015 - A minimally-deep learning PoS-tagger (trained for German CMC and Web data)
Egon Stemle
Proceedings of the 10th Web as Corpus Workshop
2014
pdf
bib
abs
‘interHist’ - an interactive visual interface for corpus exploration
Verena Lyding
|
Lionel Nicolas
|
Egon Stemle
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this article, we present interHist, a compact visualization for the interactive exploration of results to complex corpus queries. Integrated with a search interface to the PAISA corpus of Italian web texts, interHist aims at facilitating the exploration of large results sets to linguistic corpus searches. This objective is approached by providing an interactive visual overview of the data, which supports the user-steered navigation by means of interactive filtering. It allows to dynamically switch between an overview on the data and a detailed view on results in their immediate textual context, thus helping to detect and inspect relevant hits more efficiently. We provide background information on corpus linguistics and related work on visualizations for language and linguistic data. We introduce the architecture of interHist, by detailing the data structure it relies on, describing the visualization design and providing technical details of the implementation and its integration with the corpus querying environment. Finally, we illustrate its usage by presenting a use case for the analysis of the composition of Italian noun phrases.
pdf
bib
abs
KoKo: an L1 Learner Corpus for German
Andrea Abel
|
Aivars Glaznieks
|
Lionel Nicolas
|
Egon Stemle
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
We introduce the KoKo corpus, a collection of German L1 learner texts annotated with learner errors, along with the methods and tools used in its construction and evaluation. The corpus contains both texts and corresponding survey information from 1,319 pupils and amounts to around 716,000 tokens. The evaluation of the performed transcriptions and annotations shows an accuracy of orthographic error annotations of approximately 80% as well as high accuracies of transcriptions (>99%), automatic tokenisation (>99%), sentence splitting (>96%) and POS-tagging (>94%). The KoKo corpus will be published at the end of 2014. It will be the first accessible linguistically annotated German L1 learner corpus and a valuable source for research on L1 learner language as well as for teachers of German as L1, in particular with regards to writing skills.
pdf
bib
The PAISÀ Corpus of Italian Web Texts
Verena Lyding
|
Egon Stemle
|
Claudia Borghetti
|
Marco Brunello
|
Sara Castagnoli
|
Felice Dell’Orletta
|
Henrik Dittmann
|
Alessandro Lenci
|
Vito Pirrelli
Proceedings of the 9th Web as Corpus Workshop (WaC-9)
2013
pdf
bib
High-Accuracy Phrase Translation Acquisition Through Battle-Royale Selection
Lionel Nicolas
|
Egon W. Stemle
|
Klara Kranebitter
|
Verena Lyding
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013
2012
pdf
bib
Annotating Archaeological Texts: An Example of Domain-Specific Annotation in the Humanities
Francesca Bonin
|
Fabio Cavulli
|
Aronne Noriller
|
Massimo Poesio
|
Egon W. Stemle
Proceedings of the Sixth Linguistic Annotation Workshop
2011
pdf
bib
Structure-Preserving Pipelines for Digital Libraries
Massimo Poesio
|
Eduard Barbu
|
Egon Stemle
|
Christian Girardi
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
pdf
bib
PaddyWaC: A Minimally-Supervised Web-Corpus of Hiberno-English
Brian Murphy
|
Egon W. Stemle
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties
2010
pdf
bib
abs
Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus
Kepa Joseba Rodríguez
|
Francesca Delogu
|
Yannick Versley
|
Egon W. Stemle
|
Massimo Poesio
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The Live Memories corpus is an Italian corpus annotated for anaphoric relations. This annotation effort aims to contribute to two significant issues for the CL research: the lack of annotated anaphoric resources for Italian and the increasing interest for the social Web. The Live Memories Corpus contains texts from the Italian Wikipedia about the region Trentino/Süd Tirol and from blog sites with users' comments. It is planned to add a set of articles of local news papers. The corpus includes manual annotated information about morphosyntactic agreement, anaphoricity, and semantic class of the NPs. The anaphoric annotation includes discourse deixis, bridging relations and markes cases of ambiguity with the annotation of alternative interpretations. For the annotation of the anaphoric links the corpus takes into account specific phenomena of the Italian language like incorporated clitics and phonetically non realized pronouns. Reliability studies for the annotation of the mentioned phenomena and for annotation of anaphoric links in general offer satisfactory results. The Wikipedia and blogs dataset will be distributed under Creative Commons Attributions licence.