Martin Riedl
2022
Data Augmentation for Intent Classification of German Conversational Agents in the Finance Domain
Sophie Rentschler | Martin Riedl | Christian Stab | Martin Rückert
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
Sophie Rentschler | Martin Riedl | Christian Stab | Martin Rückert
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
2019
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Dmitry Ustalov | Swapna Somasundaran | Peter Jansen | Goran Glavaš | Martin Riedl | Mihai Surdeanu | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Dmitry Ustalov | Swapna Somasundaran | Peter Jansen | Goran Glavaš | Martin Riedl | Mihai Surdeanu | Michalis Vazirgiannis
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Clustering-Based Article Identification in Historical Newspapers
Martin Riedl | Daniela Betz | Sebastian Padó
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Martin Riedl | Daniela Betz | Sebastian Padó
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
This article focuses on the problem of identifying articles and recovering their text from within and across newspaper pages when OCR just delivers one text file per page. We frame the task as a segmentation plus clustering step. Our results on a sample of 1912 New York Tribune magazine shows that performing the clustering based on similarities computed with word embeddings outperforms a similarity measure based on character n-grams and words. Furthermore, the automatic segmentation based on the text results in low scores, due to the low quality of some OCRed documents.
2018
Using Semantics for Granularities of Tokenization
Martin Riedl | Chris Biemann
Computational Linguistics, Volume 44, Issue 3 - September 2018
Martin Riedl | Chris Biemann
Computational Linguistics, Volume 44, Issue 3 - September 2018
Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identification of meaningful and useful language units. This entails both identifying units composed of several single words that form a several single words that form a, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is based on the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model. First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated data sets in two languages and newly extracted evaluation data sets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information. Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation of four dedicated decompounding data sets across four languages and on data sets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods. In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval set-up. Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.
Document-based Recommender System for Job Postings using Dense Representations
Ahmed Elsafty | Martin Riedl | Chris Biemann
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
Ahmed Elsafty | Martin Riedl | Chris Biemann
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
Job boards and professional social networks heavily use recommender systems in order to better support users in exploring job advertisements. Detecting the similarity between job advertisements is important for job recommendation systems as it allows, for example, the application of item-to-item based recommendations. In this work, we research the usage of dense vector representations to enhance a large-scale job recommendation system and to rank German job advertisements regarding their similarity. We follow a two-folded evaluation scheme: (1) we exploit historic user interactions to automatically create a dataset of similar jobs that enables an offline evaluation. (2) In addition, we conduct an online A/B test and evaluate the best performing method on our platform reaching more than 1 million users. We achieve the best results by combining job titles with full-text job descriptions. In particular, this method builds dense document representation using words of the titles to weigh the importance of words of the full-text description. In the online evaluation, this approach allows us to increase the click-through rate on job recommendations for active users by 8.0%.
A Named Entity Recognition Shootout for German
Martin Riedl | Sebastian Padó
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Martin Riedl | Sebastian Padó
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We ask how to practically build a model for German named entity recognition (NER) that performs at the state of the art for both contemporary and historical texts, i.e., a big-data and a small-data scenario. The two best-performing model families are pitted against each other (linear-chain CRFs and BiLSTM) to observe the trade-off between expressiveness and data requirements. BiLSTM outperforms the CRF when large datasets are available and performs inferior for the smallest dataset. BiLSTMs profit substantially from transfer learning, which enables them to be trained on multiple corpora, resulting in a new state-of-the-art model for German NER on two contemporary German corpora (CoNLL 2003 and GermEval 2014) and two historic corpora.
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)
Goran Glavaš | Swapna Somasundaran | Martin Riedl | Eduard Hovy
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)
Goran Glavaš | Swapna Somasundaran | Martin Riedl | Eduard Hovy
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)
2017
CWIG3G2 - Complex Word Identification Task across Three Text Genres and Two User Groups
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Complex word identification (CWI) is an important task in text accessibility. However, due to the scarcity of CWI datasets, previous studies have only addressed this problem on Wikipedia sentences and have solely taken into account the needs of non-native English speakers. We collect a new CWI dataset (CWIG3G2) covering three text genres News, WikiNews, and Wikipedia) annotated by both native and non-native English speakers. Unlike previous datasets, we cover single words, as well as complex phrases, and present them for judgment in a paragraph context. We present the first study on cross-genre and cross-group CWI, showing measurable influences in native language and genre types.
Multilingual and Cross-Lingual Complex Word Identification
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Seid Muhie Yimam | Sanja Štajner | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.
Replacing OOV Words For Dependency Parsing With Distributional Semantics
Prasanth Kolachina | Martin Riedl | Chris Biemann
Proceedings of the 21st Nordic Conference on Computational Linguistics
Prasanth Kolachina | Martin Riedl | Chris Biemann
Proceedings of the 21st Nordic Conference on Computational Linguistics
Using Pseudowords for Algorithm Comparison: An Evaluation Framework for Graph-based Word Sense Induction
Flavio Massimiliano Cecchini | Chris Biemann | Martin Riedl
Proceedings of the 21st Nordic Conference on Computational Linguistics
Flavio Massimiliano Cecchini | Chris Biemann | Martin Riedl
Proceedings of the 21st Nordic Conference on Computational Linguistics
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing
Martin Riedl | Swapna Somasundaran | Goran Glavaš | Eduard Hovy
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing
Martin Riedl | Swapna Somasundaran | Goran Glavaš | Eduard Hovy
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing
There’s no ‘Count or Predict’ but task-based selection for distributional models
Martin Riedl | Chris Biemann
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers
Martin Riedl | Chris Biemann
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers
2016
Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods
Martin Riedl | Chris Biemann
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Martin Riedl | Chris Biemann
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing
Tanmoy Chakraborty | Martin Riedl | V.G.Vinod Vydiswaran
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing
Tanmoy Chakraborty | Martin Riedl | V.G.Vinod Vydiswaran
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing
Learning Paraphrasing for Multiword Expressions
Seid Muhie Yimam | Héctor Martínez Alonso | Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions
Seid Muhie Yimam | Héctor Martínez Alonso | Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions
Impact of MWE Resources on Multiword Recognition
Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions
Martin Riedl | Chris Biemann
Proceedings of the 12th Workshop on Multiword Expressions
2015
A Single Word is not Enough: Ranking Multiword Expressions Using Distributional Semantics
Martin Riedl | Chris Biemann
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
Martin Riedl | Chris Biemann
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
JoBimViz: A Web-based Visualization for Graph-based Distributional Semantic Models
Eugen Ruppert | Manuel Kaufmann | Martin Riedl | Chris Biemann
Proceedings of ACL-IJCNLP 2015 System Demonstrations
Eugen Ruppert | Manuel Kaufmann | Martin Riedl | Chris Biemann
Proceedings of ACL-IJCNLP 2015 System Demonstrations
Distributional Semantics for Resolving Bridging Mentions
Tim Feuerbach | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing
Tim Feuerbach | Martin Riedl | Chris Biemann
Proceedings of the International Conference Recent Advances in Natural Language Processing
2014
Combining Supervised and Unsupervised Parsing for Distributional Similarity
Martin Riedl | Irina Alles | Chris Biemann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Martin Riedl | Irina Alles | Chris Biemann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers
Lexical Substitution for the Medical Domain
Martin Riedl | Michael Glass | Alfio Gliozzo
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Martin Riedl | Michael Glass | Alfio Gliozzo
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Distributed Distributional Similarities of Google Books Over the Centuries
Martin Riedl | Richard Steuer | Chris Biemann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Martin Riedl | Richard Steuer | Chris Biemann
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thesauri for equal-sized time slices of the corpus. While distributional thesauri can be used as lexical resources in NLP tasks, comparing word similarities over time can unveil sense change of terms across different decades or centuries, and can serve as a resource for diachronic lexicography. Thesauri and clusters are available for download.
That’s sick dude!: Automatic identification of word sense change across different timescales
Sunny Mitra | Ritwik Mitra | Martin Riedl | Chris Biemann | Animesh Mukherjee | Pawan Goyal
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sunny Mitra | Ritwik Mitra | Martin Riedl | Chris Biemann | Animesh Mukherjee | Pawan Goyal
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2013
Scaling to Large³ Data: An Efficient and Effective Method to Compute Distributional Thesauri
Martin Riedl | Chris Biemann
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
Martin Riedl | Chris Biemann
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
Exploring Cities in Crime: Significant Concordance and Co-occurrence in Quantitative Literary Analysis
Janneke Rauscher | Leonard Swiezinski | Martin Riedl | Chris Biemann
Proceedings of the Workshop on Computational Linguistics for Literature
Janneke Rauscher | Leonard Swiezinski | Martin Riedl | Chris Biemann
Proceedings of the Workshop on Computational Linguistics for Literature
JoBimText Visualizer: A Graph-based Approach to Contextualizing Distributional Similarity
Chris Biemann | Bonaventura Coppola | Michael R. Glass | Alfio Gliozzo | Matthew Hatem | Martin Riedl
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing
Chris Biemann | Bonaventura Coppola | Michael R. Glass | Alfio Gliozzo | Matthew Hatem | Martin Riedl
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing
From Global to Local Similarities: A Graph-Based Contextualization Method using Distributional Thesauri
Martin Riedl | Chris Biemann
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing
Martin Riedl | Chris Biemann
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing
2012
How Text Segmentation Algorithms Gain from Topic Models
Martin Riedl | Chris Biemann
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Martin Riedl | Chris Biemann
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Search
Fix author
Co-authors
- Chris Biemann 23
- Goran Glavaš 3
- Swapna Somasundaran 3
- Seid Muhie Yimam 3
- Michael Glass 2
- Alfio Gliozzo 2
- Eduard Hovy 2
- Sebastian Padó 2
- Sanja Štajner 2
- Irina Alles 1
- Daniela Betz 1
- Tanmoy Chakraborty 1
- Bonaventura Coppola 1
- Ahmed Elsafty 1
- Tim Feuerbach 1
- Pawan Goyal 1
- Matthew Hatem 1
- Peter Jansen 1
- Manuel Kaufmann 1
- Prasanth Kolachina 1
- Héctor Martínez Alonso 1
- Flavio Massimiliano Cecchini 1
- Sunny Mitra 1
- Ritwik Mitra 1
- Animesh Mukherjee 1
- Janneke Rauscher 1
- Sophie Rentschler 1
- Eugen Ruppert 1
- Martin Rückert 1
- Christian Stab 1
- Richard Steuer 1
- Mihai Surdeanu 1
- Leonard Swiezinski 1
- Dmitry Ustalov 1
- Michalis Vazirgiannis 1
- V. G. Vinod Vydiswaran 1