Uwe Quasthoff


2020

pdf bib
Typical Sentences as a Resource for Valence
Uwe Quasthoff | Lars Hellan | Erik Körner | Thomas Eckart | Dirk Goldhahn | Dorothee Beermann
Proceedings of the Twelfth Language Resources and Evaluation Conference

Verb valence information can be derived from corpora by using subcorpora of typical sentences that are constructed in a language independent manner based on frequent POS structures. The inspection of typical sentences with a fixed verb in a certain position can show the valence information directly. Using verb fingerprints, consisting of the most typical sentence patterns the verb appears in, we are able to identify standard valence patterns and compare them against a language’s valence profile. With a very limited number of training data per language, valence information for other verbs can be derived as well. Based on the Norwegian valence patterns we are able to find comparative patterns in German where typical sentences are able to express the same situation in an equivalent way and can so construct verb valence pairs for a bilingual PolyVal dictionary. This contribution discusses this application with a focus on the Norwegian valence dictionary NorVal.

pdf bib
Usability and Accessibility of Bantu Language Dictionaries in the Digital Age: Mobile Access in an Open Environment
Thomas Eckart | Sonja Bosch | Uwe Quasthoff | Erik Körner | Dirk Goldhahn | Simon Kaleschke
Proceedings of the first workshop on Resources for African Indigenous Languages

This contribution describes a free and open mobile dictionary app based on open dictionary data. A specific focus is on usability and user-adequate presentation of data. This includes, in addition to the alphabetical lemma ordering, other vocabulary selection, grouping, and access criteria. Beyond search functionality for stems or roots – required due to the morphological complexity of Bantu languages – grouping of lemmas by subject area of varying difficulty allows customization. A dictionary profile defines available presentation options of the dictionary data in the app and can be specified according to the needs of the respective user group. Word embeddings and similar approaches are used to link to semantically similar or related words. The underlying data structure is open for monolingual, bilingual or multilingual dictionaries and also supports the connection to complex external resources like Wordnets. The application in its current state focuses on Xhosa and Zulu dictionary data but more resources will be integrated soon.

2018

pdf bib
Corpora of Typical Sentences
Lydia Müller | Uwe Quasthoff | Maciej Sumalvico
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Preparation and Usage of Xhosa Lexicographical Data for a Multilingual, Federated Environment
Sonja Bosch | Thomas Eckart | Bettina Klimek | Dirk Goldhahn | Uwe Quasthoff
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib
Construction and Analysis of a Large Vietnamese Text Corpus
Dieu-Thu Le | Uwe Quasthoff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample sentences where this word occurs, its left and right neighbors.

pdf bib
Features for Generic Corpus Querying
Thomas Eckart | Christoph Kuras | Uwe Quasthoff
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The availability of large corpora for more and more languages enforces generic querying and standard interfaces. This development is especially relevant in the context of integrated research environments like CLARIN or DARIAH. The paper focuses on several applications and implementation details on the basis of a unified corpus format, a unique POS tag set, and prepared data for word similarities. All described data or applications are already or will be in the near future accessible via well-documented RESTful Web services. The target group are all kinds of interested persons with varying level of experience in programming or corpus query languages.

2014

pdf bib
Vocabulary-Based Language Similarity using Web Corpora
Dirk Goldhahn | Uwe Quasthoff
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper will focus on the evaluation of automatic methods for quantifying language similarity. This is achieved by ascribing language similarity to the similarity of text corpora. This corpus similarity will first be determined by the resemblance of the vocabulary of languages. Thereto words or parts of them such as letter n-grams are examined. Extensions like transliteration of the text data will ensure the independence of the methods from text characteristics such as the writing system used. Further analyzes will show to what extent knowledge about the distribution of words in parallel text can be used in the context of language similarity.

pdf bib
High Quality Word Lists as a Resource for Multiple Purposes
Uwe Quasthoff | Dirk Goldhahn | Thomas Eckart | Erla Hallsteinsdóttir | Sabine Fiedler
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Since 2011 the comprehensive, electronically available sources of the Leipzig Corpora Collection have been used consistently for the compilation of high quality word lists. The underlying corpora include newspaper texts, Wikipedia articles and other randomly collected Web texts. For many of the languages featured in this collection, it is the first comprehensive compilation to use a large-scale empirical base. The word lists have been used to compile dictionaries with comparable frequency data in the Frequency Dictionaries series. This includes frequency data of up to 1,000,000 word forms presented in alphabetical order. This article provides an introductory description of the data and the methodological approach used. In addition, language-specific statistical information is provided with regard to letters, word structure and structural changes. Such high quality word lists also provide the opportunity to explore comparative linguistic topics and such monolingual issues as studies of word formation and frequency-based examinations of lexical areas for use in dictionaries or language teaching. The results presented here can provide initial suggestions for subsequent work in several areas of research.

pdf bib
A 500 Million Word POS-Tagged Icelandic Corpus
Thomas Eckart | Erla Hallsteinsdóttir | Sigrún Helgadóttir | Uwe Quasthoff | Dirk Goldhahn
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The new POS-tagged Icelandic corpus of the Leipzig Corpora Collection is an extensive resource for the analysis of the Icelandic language. As it contains a large share of all Web documents hosted under the .is top-level domain, it is especially valuable for investigations on modern Icelandic and non-standard language varieties. The corpus is accessible via a dedicated web portal and large shares are available for download. Focus of this paper will be the description of the tagging process and evaluation of statistical properties like word form frequencies and part of speech tag distributions. The latter will be in particular compared with values from the Icelandic Frequency Dictionary (IFD) Corpus.

pdf bib
Using Significant Word Co-occurences for the Lexical Access Problem
Rico Feist | Daniel Gerighausen | Manuel Konrad | Georg Richter | Thomas Eckart | Dirk Goldhahn | Uwe Quasthoff
Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex)

2012

pdf bib
Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages
Dirk Goldhahn | Thomas Eckart | Uwe Quasthoff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The Leipzig Corpora Collection offers free online access to 136 monolingual dictionaries enriched with statistical information. In this paper we describe current advances of the project in collecting and processing text data automatically for a large number of languages. Our main interest lies in languages of “low density”, where only few text data exists online. The aim of this approach is to create monolingual dictionaries and statistical information for a high number of new languages and to expand the existing dictionaries, opening up new possibilities for linguistic typology and other research. Focus of this paper will be set on the infrastructure for the automatic acquisition of large amounts of monolingual text in many languages from various sources. Preliminary results of the collection of text data will be presented. The mainly language-independent framework for preprocessing, cleaning and creating the corpora and computing the necessary statistics will also be depicted.

pdf bib
The Influence of Corpus Quality on Statistical Measurements on Language Resources
Thomas Eckart | Uwe Quasthoff | Dirk Goldhahn
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts """"word"""" and """"sentence"""" is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.

2010

pdf bib
SentiWS - A Publicly Available German-language Resource for Sentiment Analysis
Robert Remus | Uwe Quasthoff | Gerhard Heyer
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

SentimentWortschatz, or SentiWS for short, is a publicly available German-language resource for sentiment analysis, opinion mining etc. It lists positive and negative sentiment bearing words weighted within the interval of [-1; 1] plus their part of speech tag, and if applicable, their inflections. The current version of SentiWS (v1.8b) contains 1,650 negative and 1,818 positive words, which sum up to 16,406 positive and 16,328 negative word forms, respectively. It not only contains adjectives and adverbs explicitly expressing a sentiment, but also nouns and verbs implicitly containing one. The present work describes the resource’s structure, the three sources utilised to assemble it and the semi-supervised method incorporated to weight the strength of its entries. Furthermore the resource’s contents are extensively evaluated using a German-language evaluation set we constructed. The evaluation set is verified being reliable and its shown that SentiWS provides a beneficial lexical resource for German-language sentiment analysis related tasks to build on.

pdf bib
Automatic Annotation of Co-Occurrence Relations
Dirk Goldhahn | Uwe Quasthoff
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We introduce a method for automatically labelling edges of word co-occurrence graphs with semantic relations. Therefore we only make use of training data already contained within the graph. Starting point of this work is a graph based on word co-occurrence of the German language, which is created by applying iterated co-occurrence analysis. The edges of the graph have been partially annotated by hand with semantic relationships. In our approach we make use of the commonly appearing network motif of three words forming a triangular pattern. We assume that the fully annotated occurrences of these structures contain information useful for our purpose. Based on these patterns rules for reasoning are learned. The obtained rules are then combined using Dempster-Shafer theory to infer new semantic relations between words. Iteration of the annotation process is possible to increase the number of obtained relations. By applying the described process the graph can be enriched with semantic information at a high precision.

2008

pdf bib
ASV Toolbox: a Modular Collection of Language Exploration Tools
Chris Biemann | Uwe Quasthoff | Gerhard Heyer | Florian Holz
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

ASV Toolbox is a modular collection of tools for the exploration of written language data both for scientific and educational purposes. It includes modules that operate on word lists or texts and allow to perform various linguistic annotation, classification and clustering tasks, including language detection, POS-tagging, base form reduction, named entity recognition, and terminology extraction. On a more abstract level, the algorithms deal with various kinds of word similarity, using pattern-based and statistical approaches. The collection can be used to work on large real-world data sets as well as for studying the underlying algorithms. Each module of the ASV Toolbox is designed to work either on a plain text files or with a connection to a MySQL database. While it is especially designed to work with corpora of the Leipzig Corpora Collection, it can easily be adapted to other sources.

pdf bib
UnsuParse: unsupervised Parsing with unsupervised Part of Speech Tagging
Christian Hänig | Stefan Bordag | Uwe Quasthoff
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Based on simple methods such as observing word and part of speech tag co-occurrence and clustering, we generate syntactic parses of sentences in an entirely unsupervised and self-inducing manner. The parser learns the structure of the language in question based on measuring “breaking points” within sentences. The learning process is divided into two phases, learning and application of learned knowledge. The basic learning works in an iterative manner which results in a hierarchical constituent representation of the sentence. Part-of-Speech tags are used to circumvent the data sparseness problem for rare words. The algorithm is applied on untagged data, on manually assigned tags and on tags produced by an unsupervised part of speech tagger. The results are unsurpassed by any self-induced parser and challenge the quality of trained parsers with respect to finding certain structures such as noun phrases.

2007

pdf bib
Íslenskur Orðasjóður – Building a Large Icelandic Corpus
Erla Hallsteinsdóttir | Thomas Eckart | Chris Biemann | Uwe Quasthoff | Matthias Richter
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib
Corpus Portal for Search in Monolingual Corpora
Uwe Quasthoff | Matthias Richter | Christian Biemann
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

A simple and flexible schema for storing and presenting monolingual language resources is proposed. In this format, data for 18 different languages is already available in various sizes. The data is provided free of charge for online use and download. The main target is to ease the application of algorithms for monolingual and interlingual studies.

pdf bib
Dictionary acquisition using parallel text and co-occurrence statistics
Chris Biemann | Uwe Quasthoff
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)

2004

pdf bib
Linguistic Corpus Search
Christian Biemann | Uwe Quasthoff | Christian Wolff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Automatic Acquisition of Paradigmatic Relations Using Iterated Co-occurrences
Chris Biemann | Stefan Bordag | Uwe Quasthoff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Web Services for Language Resources and Language Technology Applications
Christian Biemann | Stefan Bordag | Uwe Quasthoff | Christian Wolff
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
Named Entity Learning and Verification: Expectation Maximization in Large Corpora
Uwe Quasthoff | Christian Biemann | Christian Wolff
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Information Extraction from Text Corpora: Using Filters on Collocation Sets
Gerhard Heyer | Uwe Quasthoff | Christian Wolff
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
A Flexible Infrastructure for Large Monolingual Corpora
Uwe Quasthoff | Christian Wolff
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)