Alexander Mehler

2025

In pharmacovigilance, effective automation of medical data structuring, especially linking entities to standardized terminologies such as MedDRA, is critical. This challenge is rarely addressed for German data. With MedLinkDE we address German MedDRA entity linking for adverse drug reactions in a two-step approach: (1) retrieval of medical terms with fine-tuned embedding models, followed (2) by guided chain-of-thought re-ranking using LLMs. To this end, we introduce RENOde, a German real-world MedDRA dataset consisting of reportings from patients and healthcare professionals. To overcome the challenges posed by the linguistic diversity of these reports, we generate synthetic data mapping the two reporting styles of patients and healthcare professionals. Our embedding models, fine-tuned on these synthetic, quasi-personalized datasets, show competitive performance with real datasets in terms of accuracy at high top- recall, providing a robust basis for re-ranking. Our subsequent guided Chain of Thought (CoT) re-ranking, informed by MedDRA coding guidelines, improves entity linking accuracy by approximately 15% (Acc@1) compared to embedding-only strategies. In this way, our approach demonstrates the feasibility of entity linking in medical reports under the constraints of data scarcity by relying on synthetic data reflecting different informant roles of reporting persons.

pdf bib abs
Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs
Omar Momen | Manuel Schaaf | Alexander Mehler
Findings of the Association for Computational Linguistics: ACL 2025

Analysing texts spanning long periods of time is critical for researchers in historical linguistics and related disciplines. However, publicly available corpora suitable for such analyses are scarce. The Project Gutenberg (PG) corpus presents a significant yet underutilized opportunity in this context, due to the absence of accurate temporal metadata. We take advantage of language models and information retrieval to explore four sources of information – Open Web, Wikipedia, Open Library API, and PG books texts – to add missing temporal metadata to the PG corpus. Through 20 experiments employing state-of-the-art Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) methods, we estimate the production years of all PG books. We curate an enriched metadata repository for the PG corpus and propose a refined version for it, which includes 53,774 books with a total of 3.8 billion tokens in 11 languages, produced between 1600 and 2000. This work provides a new resource for computational linguistics and humanities studies focusing on diachronic analyses. The final dataset and all experiments data are publicly available (https://github.com/OmarMomen14/pg-dates).

pdf bib
Multimodal Docker Unified UIMA Interface: New Horizons for Distributed Microservice-Oriented Processing of Corpora using UIMA
Daniel Bundan | Giuseppe Abrami | Alexander Mehler
Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Long and Short Papers

pdf bib abs
Towards Unified, Dynamic and Annotation-based Visualisations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer
Kevin Bönisch | Giuseppe Abrami | Alexander Mehler
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

The annotation and exploration of large text corpora, both automatic and manual, presents significant challenges across multiple disciplines, including linguistics, digital humanities, biology, and legal science. These challenges are exacerbated by the heterogeneity of processing methods, which complicates corpus visualization, interaction, and integration. To address these issues, we introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, open-source and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations. We evaluate UCE based on a user study and demonstrate its versatility as a corpus explorer based on generative AI.

2024

pdf bib abs
Dependencies over Times and Tools (DoTT)
Andy Luecking | Giuseppe Abrami | Leon Hammerla | Marc Rahn | Daniel Baumartz | Steffen Eger | Alexander Mehler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Purpose: Based on the examples of English and German, we investigate to what extent parsers trained on modern variants of these languages can be transferred to older language levels without loss. Methods: We developed a treebank called DoTT (https://github.com/texttechnologylab/DoTT) which covers, roughly, the time period from 1800 until today, in conjunction with the further development of the annotation tool DependencyAnnotator. DoTT consists of a collection of diachronic corpora enriched with dependency annotations using 3 parsers, 6 pre-trained language models, 5 newly trained models for German, and two tag sets (TIGER and Universal Dependencies). To assess how the different parsers perform on texts from different time periods, we created a gold standard sample as a benchmark. Results: We found that the parsers/models perform quite well on modern texts (document-level LAS ranging from 82.89 to 88.54) and slightly worse on older texts, as expected (average document-level LAS 84.60 vs. 86.14), but not significantly. For German texts, the (German) TIGER scheme achieved slightly better results than UD. Conclusion: Overall, this result speaks for the transferability of parsers to past language levels, at least dating back until around 1800. This very transferability, it is however argued, means that studies of language change in the field of dependency syntax can draw on dependency distance but miss out on some grammatical phenomena.

pdf bib abs
German Parliamentary Corpus (GerParCor) Reloaded
Giuseppe Abrami | Mevlüt Bagci | Alexander Mehler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In 2022, the largest German-speaking corpus of parliamentary protocols from three different centuries, on a national and federal level from the countries of Germany, Austria, Switzerland and Liechtenstein, was collected and published - GerParCor. Through GerParCor, it became possible to provide for the first time various parliamentary protocols which were not available digitally and, moreover, could not be retrieved and processed in a uniform manner. Furthermore, GerParCor was additionally preprocessed using NLP methods and made available in XMI format. In this paper, GerParCor is significantly updated by including all new parliamentary protocols in the corpus, as well as adding and preprocessing further parliamentary protocols previously not covered, so that a period up to 1797 is now covered. Besides the integration of a new, state-of-the-art and appropriate NLP preprocessing for the handling of large text corpora, this update also provides an overview of the further reuse of GerParCor by presenting various provisioning capabilities such as API’s, among others.

pdf bib abs
German SRL: Corpus Construction and Model Training
Maxim Konca | Andy Luecking | Alexander Mehler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

A useful semantic role-annotated resource for training semantic role models for the German language is missing. We point out some problems of previous resources and provide a new one due to a combined translation and alignment process: The gold standard CoNLL-2012 semantic role annotations are translated into German. Semantic role labels are transferred due to alignment models. The resulting dataset is used to train a German semantic role model. With F1-scores around 0.7, the major roles achieve competitive evaluation scores, but avoid limitations of previous approaches. The described procedure can be applied to other languages as well.

2023

pdf bib abs
Unlocking the Heterogeneous Landscape of Big Data NLP with DUUI
Alexander Leonhardt | Giuseppe Abrami | Daniel Baumartz | Alexander Mehler
Findings of the Association for Computational Linguistics: EMNLP 2023

Automatic analysis of large corpora is a complex task, especially in terms of time efficiency. This complexity is increased by the fact that flexible, extensible text analysis requires the continuous integration of ever new tools. Since there are no adequate frameworks for these purposes in the field of NLP, and especially in the context of UIMA, that are not outdated or unusable for security reasons, we present a new approach to address the latter task: Docker Unified UIMA Interface (DUUI), a scalable, flexible, lightweight, and feature-rich framework for automatic distributed analysis of text corpora that leverages Big Data experience and virtualization with Docker. We evaluate DUUI’s communication approach against a state-of-the-art approach and demonstrate its outstanding behavior in terms of time efficiency, enabling the analysis of big text data.

2022

pdf bib abs
Tafsir Dataset: A Novel Multi-Task Benchmark for Named Entity Recognition and Topic Modeling in Classical Arabic Literature
Sajawel Ahmed | Rob van der Goot | Misbahur Rehman | Carl Kruse | Ömer Özsoy | Alexander Mehler | Gemma Roig
Proceedings of the 29th International Conference on Computational Linguistics

Various historical languages, which used to be lingua franca of science and arts, deserve the attention of current NLP research. In this work, we take the first data-driven steps towards this research line for Classical Arabic (CA) by addressing named entity recognition (NER) and topic modeling (TM) on the example of CA literature. We manually annotate the encyclopedic work of Tafsir Al-Tabari with span-based NEs, sentence-based topics, and span-based subtopics, thus creating the Tafsir Dataset with over 51,000 sentences, the first large-scale multi-task benchmark for CA. Next, we analyze our newly generated dataset, which we make open-source available, with current language models (lightweight BiLSTM, transformer-based MaChAmP) along a novel script compression method, thereby achieving state-of-the-art performance for our target task CA-NER. We also show that CA-TM from the perspective of historical topic models, which are central to Arabic studies, is very challenging. With this interdisciplinary work, we lay the foundations for future research on automatic analysis of CA literature.

pdf bib abs
German Parliamentary Corpus (GerParCor)
Giuseppe Abrami | Mevlüt Bagci | Leon Hammerla | Alexander Mehler
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpus (GerParCor). GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data. In addition, GerParCor contains conversions of scanned protocols and, in particular, of protocols in Fraktur converted via an OCR process based on Tesseract. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date. GerParCor is made available in the XMI format of the UIMA project. In this way, GerParCor can be used as a large corpus of historical texts in the field of political communication for various tasks in NLP.

pdf bib abs
I still have Time(s): Extending HeidelTime for German Texts
Andy Luecking | Manuel Stoeckel | Giuseppe Abrami | Alexander Mehler
Proceedings of the Thirteenth Language Resources and Evaluation Conference

HeidelTime is one of the most widespread and successful tools for detecting temporal expressions in texts. Since HeidelTime’s pattern matching system is based on regular expression, it can be extended in a convenient way. We present such an extension for the German resources of HeidelTime: HeidelTimeExt. The extension has been brought about by means of observing false negatives within real world texts and various time banks. The gain in coverage is 2.7 % or 8.5 %, depending on the admitted degree of potential overgeneralization. We describe the development of HeidelTimeExt, its evaluation on text samples from various genres, and share some linguistic observations. HeidelTimeExt can be obtained from https://github.com/texttechnologylab/heideltime.

pdf bib abs
What do Toothbrushes do in the Kitchen? How Transformers Think our World is Structured
Alexander Henlein | Alexander Mehler
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transformer-based models are now predominant in NLP.They outperform approaches based on static models in many respects. This success has in turn prompted research that reveals a number of biases in the language models generated by transformers. In this paper we utilize this research on biases to investigate to what extent transformer-based language models allow for extracting knowledge about object relations (X occurs in Y; X consists of Z; action A involves using X).To this end, we compare contextualized models with their static counterparts. We make this comparison dependent on the application of a number of similarity measures and classifiers. Our results are threefold:Firstly, we show that the models combined with the different similarity measures differ greatly in terms of the amount of knowledge they allow for extracting. Secondly, our results suggest that similarity measures perform much worse than classifier-based approaches. Thirdly, we show that, surprisingly, static models perform almost as well as contextualized models – in some cases even better.

2021

pdf bib abs
Unleashing annotations with TextAnnotator: Multimedia, multi-perspective document views for ubiquitous annotation
Giuseppe Abrami | Alexander Henlein | Andy Lücking | Attila Kett | Pascal Adeberg | Alexander Mehler
Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable Semantic Annotation

We argue that mainly due to technical innovation in the landscape of annotation tools, a conceptual change in annotation models and processes is also on the horizon. It is diagnosed that these changes are bound up with multi-media and multi-perspective facilities of annotation tools, in particular when considering virtual reality (VR) and augmented reality (AR) applications, their potential ubiquitous use, and the exploitation of externally trained natural language pre-processing methods. Such developments potentially lead to a dynamic and exploratory heuristic construction of the annotation process. With TextAnnotator an annotation suite is introduced which focuses on multi-mediality and multi-perspectivity with an interoperable set of task-specific annotation modules (e.g., for word classification, rhetorical structures, dependency trees, semantic roles, and more) and their linkage to VR and mobile implementations. The basic architecture and usage of TextAnnotator is described and related to the above mentioned shifts in the field.

2020

pdf bib abs
Transfer of ISOSpace into a 3D Environment for Annotations and Applications
Alexander Henlein | Giuseppe Abrami | Attila Kett | Alexander Mehler
Proceedings of the 16th Joint ACL-ISO Workshop on Interoperable Semantic Annotation

People’s visual perception is very pronounced and therefore it is usually no problem for them to describe the space around them in words. Conversely, people also have no problems imagining a concept of a described space. In recent years many efforts have been made to develop a linguistic concept for spatial and spatial-temporal relations. However, the systems have not really caught on so far, which in our opinion is due to the complex models on which they are based and the lack of available training data and automated taggers. In this paper we describe a project to support spatial annotation, which could facilitate annotation by its many functions, but also enrich it with many more information. This is to be achieved by an extension by means of a VR environment, with which spatial relations can be better visualized and connected with real objects. And we want to use the available data to develop a new state-of-the-art tagger and thus lay the foundation for future systems such as improved text understanding for Text2Scene.

pdf bib abs
On the Influence of Coreference Resolution on Word Embeddings in Lexical-semantic Evaluation Tasks
Alexander Henlein | Alexander Mehler
Proceedings of the Twelfth Language Resources and Evaluation Conference

Coreference resolution (CR) aims to find all spans of a text that refer to the same entity. The F1-Scores on these task have been greatly improved by new developed End2End-approaches and transformer networks. The inclusion of CR as a pre-processing step is expected to lead to improvements in downstream tasks. The paper examines this effect with respect to word embeddings. That is, we analyze the effects of CR on six different embedding methods and evaluate them in the context of seven lexical-semantic evaluation tasks and instantiation/hypernymy detection. Especially in the last tasks we hoped for a significant increase in performance. We show that all word embedding approaches do not benefit significantly from pronoun substitution. The measurable improvements are only marginal (around 0.5% in most test cases). We explain this result with the loss of contextual information, reduction of the relative occurrence of rare words and the lack of pronouns to be replaced.

pdf bib abs
TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative Annotation of Texts
Giuseppe Abrami | Manuel Stoeckel | Alexander Mehler
Proceedings of the Twelfth Language Resources and Evaluation Conference

The annotation of texts and other material in the field of digital humanities and Natural Language Processing (NLP) is a common task of research projects. At the same time, the annotation of corpora is certainly the most time- and cost-intensive component in research projects and often requires a high level of expertise according to the research interest. However, for the annotation of texts, a wide range of tools is available, both for automatic and manual annotation. Since the automatic pre-processing methods are not error-free and there is an increasing demand for the generation of training data, also with regard to machine learning, suitable annotation tools are required. This paper defines criteria of flexibility and efficiency of complex annotations for the assessment of existing annotation tools. To extend this list of tools, the paper describes TextAnnotator, a browser-based, multi-annotation system, which has been developed to perform platform-independent multimodal annotations and annotate complex textual structures. The paper illustrates the current state of development of TextAnnotator and demonstrates its ability to evaluate annotation quality (inter-annotator agreement) at runtime. In addition, it will be shown how annotations of different users can be performed simultaneously and collaboratively on the same document from different platforms using UIMA as the basis for annotation.

pdf bib abs
Recognizing Sentence-level Logical Document Structures with the Help of Context-free Grammars
Jonathan Hildebrand | Wahed Hemati | Alexander Mehler
Proceedings of the Twelfth Language Resources and Evaluation Conference

Current sentence boundary detectors split documents into sequentially ordered sentences by detecting their beginnings and ends. Sentences, however, are more deeply structured even on this side of constituent and dependency structure: they can consist of a main sentence and several subordinate clauses as well as further segments (e.g. inserts in parentheses); they can even recursively embed whole sentences and then contain multiple sentence beginnings and ends. In this paper, we introduce a tool that segments sentences into tree structures to detect this type of recursive structure. To this end, we retrain different constituency parsers with the help of modified training data to transform them into sentence segmenters. With these segmenters, documents are mapped to sequences of sentence-related “logical document structures”. The resulting segmenters aim to improve downstream tasks by providing additional structural information. In this context, we experiment with German dependency parsing. We show that for certain sentence categories, which can be determined automatically, improvements in German dependency parsing can be achieved using our segmenter for preprocessing. The assumption suggests that improvements in other languages and tasks can be achieved.

pdf bib abs
Voting for POS tagging of Latin texts: Using the flair of FLAIR to better Ensemble Classifiers by Example of Latin
Manuel Stoeckel | Alexander Henlein | Wahed Hemati | Alexander Mehler
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

Despite the great importance of the Latin language in the past, there are relatively few resources available today to develop modern NLP tools for this language. Therefore, the EvaLatin Shared Task for Lemmatization and Part-of-Speech (POS) tagging was published in the LT4HALA workshop. In our work, we dealt with the second EvaLatin task, that is, POS tagging. Since most of the available Latin word embeddings were trained on either few or inaccurate data, we trained several embeddings on better data in the first step. Based on these embeddings, we trained several state-of-the-art taggers and used them as input for an ensemble classifier called LSTMVoter. We were able to achieve the best results for both the cross-genre and the cross-time task (90.64% and 87.00%) without using additional annotated data (closed modality). In the meantime, we further improved the system and achieved even better results (96.91% on classical, 90.87% on cross-genre and 87.35% on cross-time).

2019

pdf bib abs
When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish
Manuel Stoeckel | Wahed Hemati | Alexander Mehler
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF sequence tagger with stacked Pooled Contextualized Embeddings, word and sub-word embeddings using the open-source framework FLAIR. We present a new corpus composed of articles and papers from Spanish health science journals, termed the Spanish Health Corpus, and use it to train domain-specific embeddings which we incorporate in our model training. We achieve a result of 89.76% F1-score using pre-trained embeddings and are able to improve these results to 90.52% F1-score using specialized embeddings.

pdf bib abs
BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature
Sajawel Ahmed | Manuel Stoeckel | Christine Driller | Adrian Pachzelt | Alexander Mehler
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and butterflies. Our work enables the automatic extraction of biological information previously buried in the mass of papers and volumes. For this purpose, we generated training data for the tasks of Named Entity Recognition (NER) and Taxa Recognition (TR) in biological documents. We use this data to train a number of leading machine learning tools and create a gold standard for TR in biodiversity literature. More specifically, we perform a practical analysis of our newly generated BIOfid dataset through various downstream-task evaluations and establish a new state of the art for TR with 80.23% F-score. In this sense, our paper lays the foundations for future work in the field of information extraction in biology texts.

2018

pdf bib abs
LTV: Labeled Topic Vector
Daniel Baumartz | Tolga Uslu | Alexander Mehler
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

In this paper we present LTV, a website and API that generates labeled topic classifications based on the Dewey Decimal Classification (DDC), an international standard for topic classification in libraries. We introduce nnDDC, a largely language-independent natural network-based classifier for DDC, which we optimized using a wide range of linguistic features to achieve an F-score of 87.4%. To show that our approach is language-independent, we evaluate nnDDC using up to 40 different languages. We derive a topic model based on nnDDC, which generates probability distributions over semantic units for any input on sense-, word- and text-level. Unlike related approaches, however, these probabilities are estimated by means of nnDDC so that each dimension of the resulting vector representation is uniquely labeled by a DDC class. In this way, we introduce a neural network-based Classifier-Induced Semantic Space (nnCISS).

pdf bib
FastSense: An Efficient Word Sense Disambiguation Classifier
Tolga Uslu | Alexander Mehler | Daniel Baumartz | Wahed Hemati
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A UIMA Database Interface for Managing NLP-related Text Annotations
Giuseppe Abrami | Alexander Mehler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
TreeAnnotator: Versatile Visual Annotation of Hierarchical Text Relations
Philipp Helfrich | Elias Rieb | Giuseppe Abrami | Andy Lücking | Alexander Mehler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
WikiDragon: A Java Framework For Diachronic Content And Network Analysis Of MediaWikis
Rüdiger Gleim | Alexander Mehler | Sung Y. Song
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
TextImager as a Generic Interface to R
Tolga Uslu | Wahed Hemati | Alexander Mehler | Daniel Baumartz
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

R is a very powerful framework for statistical modeling. Thus, it is of high importance to integrate R with state-of-the-art tools in NLP. In this paper, we present the functionality and architecture of such an integration by means of TextImager. We use the OpenCPU API to integrate R based on our own R-Server. This allows for communicating with R-packages and combining them with TextImager’s NLP-components.

2016

pdf bib abs
Language classification from bilingual word embedding graphs
Steffen Eger | Armin Hoenen | Alexander Mehler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.

pdf bib abs
TextImager: a Distributed UIMA-based System for NLP
Wahed Hemati | Tolga Uslu | Alexander Mehler
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

More and more disciplines require NLP tools for performing automatic text analyses on various levels of linguistic resolution. However, the usage of established NLP frameworks is often hampered for several reasons: in most cases, they require basic to sophisticated programming skills, interfere with interoperability due to using non-standard I/O-formats and often lack tools for visualizing computational results. This makes it difficult especially for humanities scholars to use such frameworks. In order to cope with these challenges, we present TextImager, a UIMA-based framework that offers a range of NLP and visualization tools by means of a user-friendly GUI. Using TextImager requires no programming skills.

pdf bib abs
Finding Recurrent Features of Image Schema Gestures: the FIGURE corpus
Andy Luecking | Alexander Mehler | Désirée Walther | Marcel Mauri | Dennis Kurfürst
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The Frankfurt Image GestURE corpus (FIGURE) is introduced. The corpus data is collected in an experimental setting where 50 naive participants spontaneously produced gestures in response to five to six terms from a total of 27 stimulus terms. The stimulus terms have been compiled mainly from image schemata from psycholinguistics, since such schemata provide a panoply of abstract contents derived from natural language use. The gestures have been annotated for kinetic features. FIGURE aims at finding (sets of) stable kinetic feature configurations associated with the stimulus terms. Given such configurations, they can be used for designing HCI gestures that go beyond pre-defined gesture vocabularies or touchpad gestures. It is found, for instance, that movement trajectories are far more informative than handshapes, speaking against purely handshape-based HCI vocabularies. Furthermore, the mean temporal duration of hand and arm movements associated vary with the stimulus terms, indicating a dynamic dimension not covered by vocabulary-based approaches. Descriptive results are presented and related to findings from gesture studies and natural language dialogue.

pdf bib abs
Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art
Steffen Eger | Rüdiger Gleim | Alexander Mehler
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper relates to the challenge of morphological tagging and lemmatization in morphologically rich languages by example of German and Latin. We focus on the question what a practitioner can expect when using state-of-the-art solutions out of the box. Moreover, we contrast these with old(er) methods and implementations for POS tagging. We examine to what degree recent efforts in tagger development are reflected by improved accuracies ― and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-domain evaluation. Out-domain evaluations are particularly insightful because the distribution of the data which is being tagged by a user will typically differ from the distribution on which the tagger has been trained. Furthermore, two lemmatization techniques are evaluated. Finally, we compare pipeline tagging vs. a tagging approach that acknowledges dependencies between inflectional categories.

pdf bib abs
TLT-CRF: A Lexicon-supported Morphological Tagger for Latin Based on Conditional Random Fields
Tim vor der Brück | Alexander Mehler
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a morphological tagger for Latin, called TTLab Latin Tagger based on Conditional Random Fields (TLT-CRF) which uses a large Latin lexicon. Beyond Part of Speech (PoS), TLT-CRF tags eight inflectional categories of verbs, adjectives or nouns. It utilizes a statistical model based on CRFs together with a rule interpreter that addresses scenarios of sparse training data. We present results of evaluating TLT-CRF to answer the question what can be learnt following the paradigm of 1st order CRFs in conjunction with a large lexical resource and a rule interpreter. Furthermore, we investigate the contigency of representational features and targeted parts of speech to learn about selective features.

pdf bib abs
TGermaCorp – A (Digital) Humanities Resource for (Computational) Linguistics
Andy Luecking | Armin Hoenen | Alexander Mehler
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

TGermaCorp is a German text corpus whose primary sources are collected from German literature texts which date from the sixteenth century to the present. The corpus is intended to represent its target language (German) in syntactic, lexical, stylistic and chronological diversity. For this purpose, it is hand-annotated on several linguistic layers, including POS, lemma, named entities, multiword expressions, clauses, sentences and paragraphs. In order to introduce TGermaCorp in comparison to more homogeneous corpora of contemporary everyday language, quantitative assessments of syntactic and lexical diversity are provided. In this respect, TGermaCorp contributes to establishing characterising features for resource descriptions, which is needed for keeping track of a meaningful comparison of the ever-growing number of natural language resources. The assessments confirm the special role of proper names, whose propagation in text may influence lexical and syntactic diversity measures in rather trivial ways. TGermaCorp will be made available via hucompute.org.

pdf bib
On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models
Steffen Eger | Alexander Mehler
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Text2voronoi: An Image-driven Approach to Differential Diagnosis
Alexander Mehler | Tolga Uslu | Wahed Hemati
Proceedings of the 5th Workshop on Vision and Language

2015

pdf bib
Towards Semantic Language Classification: Inducing and Clustering Semantic Association Networks from Europarl
Steffen Eger | Niko Schenk | Alexander Mehler
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

pdf bib
Lexicon-assisted tagging and lemmatization in Latin: A comparison of six taggers and two lemmatization methods
Steffen Eger | Tim vor der Brück | Alexander Mehler
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

2014

pdf bib abs
ColLex.en: Automatically Generating and Evaluating a Full-form Lexicon for English
Tim vor der Brück | Alexander Mehler | Zahurul Islam
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The paper describes a procedure for the automatic generation of a large full-form lexicon of English. We put emphasis on two statistical methods to lexicon extension and adjustment: in terms of a letter-based HMM and in terms of a detector of spelling variants and misspellings. The resulting resource, \collexen, is evaluated with respect to two tasks: text categorization and lexical coverage by example of the SUSANNE corpus and the \openanc.

2012

pdf bib abs
Customization of the Europarl Corpus for Translation Studies
Zahurul Islam | Alexander Mehler
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Currently, the area of translation studies lacks corpora by which translation scholars can validate their theoretical claims, for example, regarding the scope of the characteristics of the translation relation. In this paper, we describe a customized resource in the area of translation studies that mainly addresses research on the properties of the translation relation. Our experimental results show that the Type-Token-Ratio (TTR) is not a universally valid indicator of the simplification of translation.

pdf bib
Text Readability Classification of Textbooks of a Low-Resource Language
Zahurul Islam | Alexander Mehler | Rashedur Rahman
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

2010

pdf bib abs
Computational Linguistics for Mere Mortals - Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities
Rüdiger Gleim | Alexander Mehler
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces.

pdf bib abs
The Ariadne System: A Flexible and Extensible Framework for the Modeling and Storage of Experimental Data in the Humanities.
Peter Menke | Alexander Mehler
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

During the last decades, interdisciplinarity has become a central keyword in research. As a consequence, many concepts, theories and scientific methods get in contact with each other, resulting in many different strategies and variants of acquiring, structuring, and sharing data sets. To handle these kind of data sets, his paper introduces the Ariadne Corpus Management System that allows researchers to manage and create multimodal corpora from multiple heteogeneous data sources. After an introductory demarcation from other annotation and corpus management tools, the underlying data model is presented which enables users to represent and process heterogeneous data sets within a single, consistent framework. Secondly, a set of automatized procedures is described that offers assistance to researchers in various data-related use cases. Thirdly, an approach to easy yet powerful data retrieval is introduced in form of a specialised querying language for multimodal data. Finally, the web-based graphical user interface and its advantages are illustrated.

2009

pdf bib
eHumanities Desktop - An Online System for Corpus Management and Analysis in Support of Computing in the Humanities
Rüdiger Gleim | Ulli Waltinger | Alexandra Ernst | Alexander Mehler | Tobias Feith | Dietmar Esch
Proceedings of the Demonstrations Session at EACL 2009

2008

We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres.

pdf bib abs
A Unified Database of Dependency Treebanks: Integrating, Quantifying & Evaluating Dependency Data
Olga Pustylnikov | Alexander Mehler | Rüdiger Gleim
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface.