2024
pdf
bib
Could Style Help Plagiarism Detection? - A Sample-based Quantitative Study of Correlation between Style Specifics and Plagiarism
Adile Uka
|
Maria Berger
Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 2
pdf
bib
abs
Applying Transfer Learning to German Metaphor Prediction
Maria Berger
|
Nieke Kiwitt
|
Sebastian Reimann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents results in transfer-learning metaphor recognition in German. Starting from an English language corpus annotated for metaphor at the sentence level, and its machine-translation to German, we annotate 1000 sentences of the German part to use it as a Gold standard for two different metaphor prediction setups: i) a sequence labeling set-up (on the token-level), and ii) a classification (based on sentences) setup. We test two transfer leaning approaches: i) a group of transformer models, and ii) a technique that utilizes bilingual embeddings together with an RNN classifier. We find out that the transformer models do moderately in a zero-shot scenario (up to 61% F1 for classification) and the embeddings approaches do not even beat the guessing baseline (36% F1 for classification). We use our Gold data to fine-tune the classification tasks on target-language data achieving up to 90% F1 with both, the multilingual BERT and the bilingual embeddings. We also publish the annotated bilingual corpus.
2022
pdf
bib
abs
Transfer Learning Parallel Metaphor using Bilingual Embeddings
Maria Berger
Proceedings of the 3rd Workshop on Figurative Language Processing (FLP)
Automated metaphor detection in languages other than English is highly restricted as training corpora are comparably rare. One way to overcome this problem is transfer learning. This paper gives an overview on transfer learning techniques applied to NLP. We first introduce types of transfer learning, then we present work focusing on: i) transfer learning with cross-lingual embeddings; ii) transfer learning in machine translation; and iii) transfer learning using pre-trained transformer models. The paper is complemented by first experiments that make use of bilingual embeddings generated from different sources of parallel data: We i) present the preparation of a parallel Gold corpus; ii) examine the embeddings spaces to search for metaphoric words cross-lingually; iii) run first experiments in transfer learning German metaphor from English labeled data only. Results show that finding data sources for bilingual embeddings training and the vocabulary covered by these embeddings is critical for learning metaphor cross-lingually.
2021
pdf
bib
abs
Increasing Sentence-Level Comprehension Through Text Classification of Epistemic Functions
Maria Berger
|
Elizabeth Goldstein
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop
Word embeddings capture semantic meaning of individual words. How to bridge word-level linguistic knowledge with sentence-level language representation is an open problem. This paper examines whether sentence-level representations can be achieved by building a custom sentence database focusing on one aspect of a sentence’s meaning. Our three separate semantic aspects are whether the sentence: (1) communicates a causal relationship, (2) indicates that two things are correlated with each other, and (3) expresses information or knowledge. The three classifiers provide epistemic information about a sentence’s content.
2020
pdf
bib
abs
European Language Grid: An Overview
Georg Rehm
|
Maria Berger
|
Ela Elsholz
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Stelios Piperidis
|
Miltos Deligiannis
|
Dimitris Galanis
|
Katerina Gkirtzou
|
Penny Labropoulou
|
Kalina Bontcheva
|
David Jones
|
Ian Roberts
|
Jan Hajič
|
Jana Hamrlová
|
Lukáš Kačena
|
Khalid Choukri
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Orians Anvari
|
Andis Lagzdiņš
|
Jūlija Meļņika
|
Gerhard Backfried
|
Erinç Dikici
|
Miroslav Janosik
|
Katja Prinz
|
Christoph Prinz
|
Severin Stampler
|
Dorothea Thomas-Aniola
|
José Manuel Gómez-Pérez
|
Andres Garcia Silva
|
Christian Berrío
|
Ulrich Germann
|
Steve Renals
|
Ondrej Klejch
Proceedings of the Twelfth Language Resources and Evaluation Conference
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes.
pdf
bib
abs
Making Metadata Fit for Next Generation Language Technology Platforms: The Metadata Schema of the European Language Grid
Penny Labropoulou
|
Katerina Gkirtzou
|
Maria Gavriilidou
|
Miltos Deligiannis
|
Dimitris Galanis
|
Stelios Piperidis
|
Georg Rehm
|
Maria Berger
|
Valérie Mapelli
|
Michael Rigault
|
Victoria Arranz
|
Khalid Choukri
|
Gerhard Backfried
|
José Manuel Gómez-Pérez
|
Andres Garcia-Silva
Proceedings of the Twelfth Language Resources and Evaluation Conference
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies (processing and generation services and tools, models, corpora, term lists, etc.), as well as related entities (e.g., organizations, projects, supporting documents, etc.). The schema powers the European Language Grid platform that aims to be the primary hub and marketplace for industry-relevant Language Technology in Europe. ELG-SHARE has been based on various metadata schemas, vocabularies, and ontologies, as well as related recommendations and guidelines.
2018
pdf
bib
Lexical and Semantic Features for Cross-lingual Text Reuse Classification: an Experiment in English and Latin Paraphrases
Maria Moritz
|
David Steding
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
A Method for Human-Interpretable Paraphrasticality Prediction
Maria Moritz
|
Johannes Hellrich
|
Sven Büchel
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
The detection of reused text is important in a wide range of disciplines. However, even as research in the field of plagiarism detection is constantly improving, heavily modified or paraphrased text is still challenging for current methodologies. For historical texts, these problems are even more severe, since text sources were often subject to stronger and more frequent modifications. Despite the need for tools to automate text criticism, e.g., tracing modifications in historical text, algorithmic support is still limited. While current techniques can tell if and how frequently a text has been modified, very little work has been done on determining the degree and kind of paraphrastic modification—despite such information being of substantial interest to scholars. We present a human-interpretable, feature-based method to measure paraphrastic modification. Evaluating our technique on three data sets, we find that our approach performs competitive to text similarity scores borrowed from machine translation evaluation, being much harder to interpret.
2017
pdf
bib
Ambiguity in Semantically Related Word Substitutions: an investigation in historical Bible translations
Maria Moritz
|
Marco Büchler
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language
2016
pdf
bib
Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse
Maria Moritz
|
Andreas Wiederhold
|
Barbara Pavlek
|
Yuri Bizzoni
|
Marco Büchler
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
2014
pdf
bib
abs
Open Philology at the University of Leipzig
Frederik Baumgardt
|
Giuseppe Celano
|
Gregory R. Crane
|
Stella Dee
|
Maryam Foradi
|
Emily Franzini
|
Greta Franzini
|
Monica Lent
|
Maria Moritz
|
Simona Stoyanova
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The Open Philology Project at the University of Leipzig aspires to re-assert the value of philology in its broadest sense. Philology signifies the widest possible use of the linguistic record to enable a deep understanding of the complete lived experience of humanity. Pragmatically, we focus on Greek and Latin because (1) substantial collections and services are already available within these languages, (2) substantial user communities exist (c. 35,000 unique users a month at the Perseus Digital Library), and (3) a European-based project is better positioned to process extensive cultural heritage materials in these languages rather than in Chinese or Sanskrit. The Open Philology Project has been designed with the hope that it can contribute to any historical language that survives within the human record. It includes three tasks: (1) the creation of an open, extensible, repurposable collection of machine-readable linguistic sources; (2) the development of dynamic textbooks that use annotated corpora to customize the vocabulary and grammar of texts that learners want to read, and at the same time engage students in collaboratively producing new annotated data; (3) the establishment of new workflows for, and forms of, publication, from individual annotations with argumentation to traditional publications with integrated machine-actionable data.