Simon Clematide


2022

pdf bib
Evaluation of Transfer Learning and Domain Adaptation for Analyzing German-Speaking Job Advertisements
Ann-Sophie Gnehm | Eva Bühlmann | Simon Clematide
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper presents text mining approaches on German-speaking job advertisements to enable social science research on the development of the labour market over the last 30 years. In order to build text mining applications providing information about profession and main task of a job, as well as experience and ICT skills needed, we experiment with transfer learning and domain adaptation. Our main contribution consists in building language models which are adapted to the domain of job advertisements, and their assessment on a broad range of machine learning problems. Our findings show the large value of domain adaptation in several respects. First, it boosts the performance of fine-tuned task-specific models consistently over all evaluation experiments. Second, it helps to mitigate rapid data shift over time in our special domain, and enhances the ability to learn from small updates with new, labeled task data. Third, domain-adaptation of language models is efficient: With continued in-domain pre-training we are able to outperform general-domain language models pre-trained on ten times more data. We share our domain-adapted language models and data with the research community.

pdf bib
Evaluation of HTR models without Ground Truth Material
Phillip Benjamin Ströbel | Martin Volk | Simon Clematide | Raphael Schwitter | Tobias Hodel | David Schoch
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.

pdf bib
On Isotropy Calibration of Transformer Models
Yue Ding | Karolis Martinkus | Damian Pascual | Simon Clematide | Roger Wattenhofer
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Different studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic - the embeddings are distributed in a narrow cone. Meanwhile, static word representations (e.g., Word2Vec or GloVe) have been shown to benefit from isotropic spaces. Therefore, previous work has developed methods to calibrate the embedding space of transformers in order to ensure isotropy. However, a recent study (Cai et al. 2021) shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space. In this work, we conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks. These results support the thesis that, given the local isotropy, transformers do not benefit from additional isotropy calibration.

pdf bib
CLUZH at SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation
Silvan Wehrli | Simon Clematide | Peter Makarov
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes the submissions of the team of the Department of Computational Linguistics, University of Zurich, to the SIGMORPHON 2022 Shared Tasks on Morpheme Segmentation and Inflection Generation. Our submissions use a character-level neural transducer that operates over traditional edit actions. While this model has been found particularly wellsuited for low-resource settings, using it with large data quantities has been difficult. Existing implementations could not fully profit from GPU acceleration and did not efficiently implement mini-batch training, which could be tricky for a transition-based system. For this year’s submission, we have ported the neural transducer to PyTorch and implemented true mini-batch training. This has allowed us to successfully scale the approach to large data quantities and conduct extensive experimentation. We report competitive results for morpheme segmentation (including sharing first place in part 2 of the challenge). We also demonstrate that reducing sentence-level morpheme segmentation to a word-level problem is a simple yet effective strategy. Additionally, we report strong results in inflection generation (the overall best result for large training sets in part 1, the best results in low-resource learning trajectories in part 2). Our code is publicly available.

pdf bib
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)
Piotr Banski | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

2021

pdf bib
Results of the Second SIGMORPHON Shared Task on Multilingual Grapheme-to-Phoneme Conversion
Lucas F.E. Ashby | Travis M. Bartley | Simon Clematide | Luca Del Signore | Cameron Gibson | Kyle Gorman | Yeonju Lee-Sikka | Peter Makarov | Aidan Malanoski | Sean Miller | Omar Ortiz | Reuben Raff | Arundhati Sengupta | Bora Seo | Yulia Spektor | Winnie Yan
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Grapheme-to-phoneme conversion is an important component in many speech technologies, but until recently there were no multilingual benchmarks for this task. The second iteration of the SIGMORPHON shared task on multilingual grapheme-to-phoneme conversion features many improvements from the previous year’s task (Gorman et al. 2020), including additional languages, a stronger baseline, three subtasks varying the amount of available resources, extensive quality assurance procedures, and automated error analyses. Four teams submitted a total of thirteen systems, at best achieving relative reductions of word error rate of 11% in the high-resource subtask and 4% in the low-resource subtask.

pdf bib
CLUZH at SIGMORPHON 2021 Shared Task on Multilingual Grapheme-to-Phoneme Conversion: Variations on a Baseline
Simon Clematide | Peter Makarov
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes the submission by the team from the Department of Computational Linguistics, Zurich University, to the Multilingual Grapheme-to-Phoneme Conversion (G2P) Task 1 of the SIGMORPHON 2021 challenge in the low and medium settings. The submission is a variation of our 2020 G2P system, which serves as the baseline for this year’s challenge. The system is a neural transducer that operates over explicit edit actions and is trained with imitation learning. For this challenge, we experimented with the following changes: a) emitting phoneme segments instead of single character phonemes, b) input character dropout, c) a mogrifier LSTM decoder (Melis et al., 2019), d) enriching the decoder input with the currently attended input character, e) parallel BiLSTM encoders, and f) an adaptive batch size scheduler. In the low setting, our best ensemble improved over the baseline, however, in the medium setting, the baseline was stronger on average, although for certain languages improvements could be observed.

pdf bib
Searching for Legal Documents at Paragraph Level: Automating Label Generation and Use of an Extended Attention Mask for Boosting Neural Models of Semantic Similarity
Li Tang | Simon Clematide
Proceedings of the Natural Legal Language Processing Workshop 2021

Searching for legal documents is a specialized Information Retrieval task that is relevant for expert users (lawyers and their assistants) and for non-expert users. By searching previous court decisions (cases), a user can better prepare the legal reasoning of a new case. Being able to search using a natural language text snippet instead of a more artificial query could help to prevent query formulation issues. Also, if semantic similarity could be modeled beyond exact lexical matches, more relevant results can be found even if the query terms don’t match exactly. For this domain, we formulated a task to compare different ways of modeling semantic similarity at paragraph level, using neural and non-neural systems. We compared systems that encode the query and the search collection paragraphs as vectors, enabling the use of cosine similarity for results ranking. After building a German dataset for cases and statutes from Switzerland, and extracting citations from cases to statutes, we developed an algorithm for estimating semantic similarity at paragraph level, using a link-based similarity method. When evaluating different systems in this way, we find that semantic similarity modeling by neural systems can be boosted with an extended attention mask that quenches noise in the inputs.

2020

pdf bib
Semi-supervised Contextual Historical Text Normalization
Peter Makarov | Simon Clematide
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Historical text normalization, the task of mapping historical word forms to their modern counterparts, has recently attracted a lot of interest (Bollmann, 2019; Tang et al., 2018; Lusetti et al., 2018; Bollmann et al., 2018;Robertson and Goldwater, 2018; Bollmannet al., 2017; Korchagina, 2017). Yet, virtually all approaches suffer from the two limitations: 1) They consider a fully supervised setup, often with impractically large manually normalized datasets; 2) Normalization happens on words in isolation. By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, we train accurate models with unlabeled historical data. In realistic training scenarios, our approach often leads to reduction in manually normalized data at the same accuracy levels.

pdf bib
Text Zoning and Classification for Job Advertisements in German, French and English
Ann-Sophie Gnehm | Simon Clematide
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

We present experiments to structure job ads into text zones and classify them into pro- fessions, industries and management functions, thereby facilitating social science analyses on labor marked demand. Our main contribution are empirical findings on the benefits of contextualized embeddings and the potential of multi-task models for this purpose. With contextualized in-domain embeddings in BiLSTM-CRF models, we reach an accuracy of 91% for token-level text zoning and outperform previous approaches. A multi-tasking BERT model performs well for our classification tasks. We further compare transfer approaches for our multilingual data.

pdf bib
Language Resources for Historical Newspapers: the Impresso Collection
Maud Ehrmann | Matteo Romanello | Simon Clematide | Phillip Benjamin Ströbel | Raphaël Barman
Proceedings of the Twelfth Language Resources and Evaluation Conference

Following decades of massive digitization, an unprecedented amount of historical document facsimiles can now be retrieved and accessed via cultural heritage online portals. If this represents a huge step forward in terms of preservation and accessibility, the next fundamental challenge– and real promise of digitization– is to exploit the contents of these digital assets, and therefore to adapt and develop appropriate language technologies to search and retrieve information from this ‘Big Data of the Past’. Yet, the application of text processing tools on historical documents in general, and historical newspapers in particular, poses new challenges, and crucially requires appropriate language resources. In this context, this paper presents a collection of historical newspaper data sets composed of text and image resources, curated and published within the context of the ‘impresso - Media Monitoring of the Past’ project. With corpora, benchmarks, semantic annotations and language models in French, German and Luxembourgish covering ca. 200 years, the objective of the impresso resource collection is to contribute to historical language resources, and thereby strengthen the robustness of approaches to non-standard inputs and foster efficient processing of historical documents.

pdf bib
How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR
Phillip Benjamin Ströbel | Simon Clematide | Martin Volk
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

pdf bib
CLUZH at SIGMORPHON 2020 Shared Task on Multilingual Grapheme-to-Phoneme Conversion
Peter Makarov | Simon Clematide
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper describes the submission by the team from the Institute of Computational Linguistics, Zurich University, to the Multilingual Grapheme-to-Phoneme Conversion (G2P) Task of the SIGMORPHON 2020 challenge. The submission adapts our system from the 2018 edition of the SIGMORPHON shared task. Our system is a neural transducer that operates over explicit edit actions and is trained with imitation learning. It is well-suited for morphological string transduction partly because it exploits the fact that the input and output character alphabets overlap. The challenge posed by G2P has been to adapt the model and the training procedure to work with disjoint alphabets. We adapt the model to use substitution edits and train it with a weighted finite-state transducer acting as the expert policy. An ensemble of such models produces competitive results on G2P. Our submission ranks second out of 23 submissions by a total of nine teams.

pdf bib
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen | Ines Pisetta
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

2019

pdf bib
Geotagging a Diachronic Corpus of Alpine Texts: Comparing Distinct Approaches to Toponym Recognition
Tannon Kew | Anastassia Shaitarova | Isabel Meraner | Janis Goldzycher | Simon Clematide | Martin Volk
Proceedings of the Workshop on Language Technology for Digital Historical Archives

Geotagging historic and cultural texts provides valuable access to heritage data, enabling location-based searching and new geographically related discoveries. In this paper, we describe two distinct approaches to geotagging a variety of fine-grained toponyms in a diachronic corpus of alpine texts. By applying a traditional gazetteer-based approach, aided by a few simple heuristics, we attain strong high-precision annotations. Using the output of this earlier system, we adopt a state-of-the-art neural approach in order to facilitate the detection of new toponyms on the basis of context. Additionally, we present the results of preliminary experiments on integrating a small amount of crowdsourced annotations to improve overall performance of toponym recognition in our heritage corpus.

2018

pdf bib
Imitation Learning for Neural Morphological String Transduction
Peter Makarov | Simon Clematide
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We employ imitation learning to train a neural transition-based string transducer for morphological tasks such as inflection generation and lemmatization. Previous approaches to training this type of model either rely on an external character aligner for the production of gold action sequences, which results in a suboptimal model due to the unwarranted dependence on a single gold action sequence despite spurious ambiguity, or require warm starting with an MLE model. Our approach only requires a simple expert policy, eliminating the need for a character aligner or warm start. It also addresses familiar MLE training biases and leads to strong and state-of-the-art performance on several benchmarks.

pdf bib
UZH at CoNLLSIGMORPHON 2018 Shared Task on Universal Morphological Reinflection
Peter Makarov | Simon Clematide
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib
Strategies and Challenges for Crowdsourcing Regional Dialect Perception Data for Swiss German and Swiss French
Jean-Philippe Goldman | Simon Clematide | Mathieu Avanzi | Raphael Tandler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Neural Transition-based String Transduction for Limited-Resource Setting in Morphology
Peter Makarov | Simon Clematide
Proceedings of the 27th International Conference on Computational Linguistics

We present a neural transition-based model that uses a simple set of edit actions (copy, delete, insert) for morphological transduction tasks such as inflection generation, lemmatization, and reinflection. In a large-scale evaluation on four datasets and dozens of languages, our approach consistently outperforms state-of-the-art systems on low and medium training-set sizes and is competitive in the high-resource setting. Learning to apply a generic copy action enables our approach to generalize quickly from a few data points. We successfully leverage minimum risk training to compensate for the weaknesses of MLE parameter learning and neutralize the negative effects of training a pipeline with a separate character aligner.

2017

pdf bib
Stance Detection in Facebook Posts of a German Right-wing Party
Manfred Klenner | Don Tuggener | Simon Clematide
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

We argue that in order to detect stance, not only the explicit attitudes of the stance holder towards the targets are crucial. It is the whole narrative the writer drafts that counts, including the way he hypostasizes the discourse referents: as benefactors or villains, as victims or beneficiaries. We exemplify the ability of our system to identify targets and detect the writer’s stance towards them on the basis of about 100 000 Facebook posts of a German right-wing party. A reader and writer model on top of our verb-based attitude extraction directly reveal stance conflicts.

pdf bib
CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects
Simon Clematide | Peter Makarov
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

Our submissions for the GDI 2017 Shared Task are the results from three different types of classifiers: Naïve Bayes, Conditional Random Fields (CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted F1 score of 65% (third rank) being beaten by the best system by 0.9%. Measured by classification accuracy, our ensemble run (Naïve Bayes, CRF, SVM) reaches 67% (second rank) being 1% lower than the best system. We also describe our experiments with Recurrent Neural Network (RNN) architectures. Since they performed worse than our non-neural approaches we did not include them in the submission.

pdf bib
Align and Copy: UZH at SIGMORPHON 2017 Shared Task for Morphological Reinflection
Peter Makarov | Tatiana Ruzsics | Simon Clematide
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

2016

pdf bib
Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus
Simon Clematide | Lenz Furrer | Martin Volk
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 month, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abby FineReader 7 for each page are available as a resource. Additionally, the scanned images (300dpi) of all pages are included in order to facilitate tests with other OCR software.

pdf bib
How Factuality Determines Sentiment Inferences
Manfred Klenner | Simon Clematide
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2014

pdf bib
Collaboratively Annotating Multilingual Parallel Corpora in the Biomedical Domain—some MANTRAs
Johannes Hellrich | Simon Clematide | Udo Hahn | Dietrich Rebholz-Schuhmann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The coverage of multilingual biomedical resources is high for the English language, yet sparse for non-English languages―an observation which holds for seemingly well-resourced, yet still dramatically low-resourced ones such as Spanish, French or German but even more so for really under-resourced ones such as Dutch. We here present experimental results for automatically annotating parallel corpora and simultaneously acquiring new biomedical terminology for these under-resourced non-English languages on the basis of two types of language resources, namely parallel corpora (i.e. full translation equivalents at the document unit level) and (admittedly deficient) multilingual biomedical terminologies, with English as their anchor language. We automatically annotate these parallel corpora with biomedical named entities by an ensemble of named entity taggers and harmonize non-identical annotations the outcome of which is a so-called silver standard corpus. We conclude with an empirical assessment of this approach to automatically identify both known and new terms in multilingual corpora.

pdf bib
Using Large Biomedical Databases as Gold Annotations for Automatic Relation Extraction
Tilia Ellendorff | Fabio Rinaldi | Simon Clematide
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We show how to use large biomedical databases in order to obtain a gold standard for training a machine learning system over a corpus of biomedical text. As an example we use the Comparative Toxicogenomics Database (CTD) and describe by means of a short case study how the obtained data can be applied. We explain how we exploit the structure of the database for compiling training material and a testset. Using a Naive Bayes document classification approach based on words, stem bigrams and MeSH descriptors we achieve a macro-average F-score of 61% on a subset of 8 action terms. This outperforms a baseline system based on a lookup of stemmed keywords by more than 20%. Furthermore, we present directions of future work, taking the described system as a vantage point. Future work will be aiming towards a weakly supervised system capable of discovering complete biomedical interactions and events.

pdf bib
Detecting Code-Switching in a Multilingual Alpine Heritage Corpus
Martin Volk | Simon Clematide
Proceedings of the First Workshop on Computational Approaches to Code Switching

2013

pdf bib
UZH in BioNLP 2013
Gerold Schneider | Simon Clematide | Tilia Ellendorff | Don Tuggener | Fabio Rinaldi | Gintarė Grigonytė
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf bib
A Pilot Study on the Semantic Classification of Two German Prepositions: Combining Monolingual and Multilingual Evidence
Simon Clematide | Manfred Klenner
Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013

2012

pdf bib
MLSA — A Multi-layered Reference Corpus for German Sentiment Analysis
Simon Clematide | Stefan Gindl | Manfred Klenner | Stefanos Petrakis | Robert Remus | Josef Ruppenhofer | Ulli Waltinger | Michael Wiegand
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we describe MLSA, a publicly available multi-layered reference corpus for German-language sentiment analysis. The construction of the corpus is based on the manual annotation of 270 German-language sentences considering three different layers of granularity. The sentence-layer annotation, as the most coarse-grained annotation, focuses on aspects of objectivity, subjectivity and the overall polarity of the respective sentences. Layer 2 is concerned with polarity on the word- and phrase-level, annotating both subjective and factual language. The annotations on Layer 3 focus on the expression-level, denoting frames of private states such as objective and direct speech events. These three layers and their respective annotations are intended to be fully independent of each other. At the same time, exploring for and discovering interactions that may exist between different layers should also be possible. The reliability of the respective annotations was assessed using the average pairwise agreement and Fleiss' multi-rater measures. We believe that MLSA is a beneficial resource for sentiment analysis research, algorithms and applications that focus on the German language.

pdf bib
Dependency parsing for interaction detection in pharmacogenomics
Gerold Schneider | Fabio Rinaldi | Simon Clematide
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We give an overview of our approach to the extraction of interactions between pharmacogenomic entities like drugs, genes and diseases and suggest classes of interaction types driven by data from PharmGKB and partly following the top level ontology WordNet and biomedical types from BioNLP. Our text mining approach to the extraction of interactions is based on syntactic analysis. We use syntactic analyses to explore domain events and to suggest a set of interaction labels for the pharmacogenomics domain.

2011

pdf bib
An Incremental Model for the Coreference Resolution Task of BioNLP 2011
Don Tuggener | Manfred Klenner | Gerold Schneider | Simon Clematide | Fabio Rinaldi
Proceedings of BioNLP Shared Task 2011 Workshop