2021
pdf
bib
abs
Compound or Term Features? Analyzing Salience in Predicting the Difficulty of German Noun Compounds across Domains
Anna Hätty
|
Julia Bettinger
|
Michael Dorna
|
Jonas Kuhn
|
Sabine Schulte im Walde
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics
Predicting the difficulty of domain-specific vocabulary is an important task towards a better understanding of a domain, and to enhance the communication between lay people and experts. We investigate German closed noun compounds and focus on the interaction of compound-based lexical features (such as frequency and productivity) and terminology-based features (contrasting domain-specific and general language) across word representations and classifiers. Our prediction experiments complement insights from classification using (a) manually designed features to characterise termhood and compound formation and (b) compound and constituent word embeddings. We find that for a broad binary distinction into ‘easy’ vs. ‘difficult’ general-language compound frequency is sufficient, but for a more fine-grained four-class distinction it is crucial to include contrastive termhood features and compound and constituent features.
2020
pdf
bib
abs
Predicting Degrees of Technicality in Automatic Terminology Extraction
Anna Hätty
|
Dominik Schlechtweg
|
Michael Dorna
|
Sabine Schulte im Walde
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
While automatic term extraction is a well-researched area, computational approaches to distinguish between degrees of technicality are still understudied. We semi-automatically create a German gold standard of technicality across four domains, and illustrate the impact of a web-crawled general-language corpus on technicality prediction. When defining a classification approach that combines general-language and domain-specific word embeddings, we go beyond previous work and align vector spaces to gain comparative embeddings. We suggest two novel models to exploit general- vs. domain-specific comparisons: a simple neural network model with pre-computed comparative-embedding information as input, and a multi-channel model computing the comparison internally. Both models outperform previous approaches, with the multi-channel model performing best.
pdf
bib
abs
A Domain-Specific Dataset of Difficulty Ratings for German Noun Compounds in the Domains DIY, Cooking and Automotive
Julia Bettinger
|
Anna Hätty
|
Michael Dorna
|
Sabine Schulte im Walde
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present a dataset with difficulty ratings for 1,030 German closed noun compounds extracted from domain-specific texts for do-it-ourself (DIY), cooking and automotive. The dataset includes two-part compounds for cooking and DIY, and two- to four-part compounds for automotive. The compounds were identified in text using the Simple Compound Splitter (Weller-Di Marco, 2017); a subset was filtered and balanced for frequency and productivity criteria as basis for manual annotation and fine-grained interpretation. This study presents the creation, the final dataset with ratings from 20 annotators and statistics over the dataset, to provide insight into the perception of domain-specific term difficulty. It is particularly striking that annotators agree on a coarse, binary distinction between easy vs. difficult domain-specific compounds but that a more fine grained distinction of difficulty is not meaningful. We finally discuss the challenges of an annotation for difficulty, which includes both the task description as well as the selection of the data basis.
2017
pdf
bib
abs
Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction
Anna Hätty
|
Michael Dorna
|
Sabine Schulte im Walde
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics
Feature design and selection is a crucial aspect when treating terminology extraction as a machine learning classification problem. We designed feature classes which characterize different properties of terms based on distributions, and propose a new feature class for components of term candidates. By using random forests, we infer optimal features which are later used to build decision tree classifiers. We evaluate our method using the ACL RD-TEC dataset. We demonstrate the importance of the novel feature class for downgrading termhood which exploits properties of term components. Furthermore, our classification suggests that the identification of reliable term candidates should be performed successively, rather than just once.
2016
pdf
bib
abs
Acquisition of semantic relations between terms: how far can we get with standard NLP tools?
Ina Roesiger
|
Julia Bettinger
|
Johannes Schäfer
|
Michael Dorna
|
Ulrich Heid
Proceedings of the 5th International Workshop on Computational Terminology (Computerm2016)
The extraction of data exemplifying relations between terms can make use, at least to a large extent, of techniques that are similar to those used in standard hybrid term candidate extraction, namely basic corpus analysis tools (e.g. tagging, lemmatization, parsing), as well as morphological analysis of complex words (compounds and derived items). In this article, we discuss the use of such techniques for the extraction of raw material for a description of relations between terms, and we provide internal evaluation data for the devices developed. We claim that user-generated content is a rich source of term variation through paraphrasing and reformulation, and that these provide relational data at the same time as term variants. Germanic languages with their rich word formation morphology may be particularly good candidates for the approach advocated here.
1998
pdf
bib
abs
Quality and robustness in MT—A balancing act
Bianka Buschbeck-Wolf
|
Michael Dorna
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers
The speech-to-speech translation system Verbmobil integrates deep and shallow analysis modules that produce linguistic representations in parallel. Thus, the input representations for the transfer module differ with respect to their depth and quality. This gives rise to two problems: (i) the transfer database has to be adjusted according to input quality, and (ii) translations produced have to be ranked with respect to their quality in order to select the most appropriate result. This paper presents an operationalized solution to both problems.
pdf
bib
Managing information at linguistic interfaces
Johan Bos
|
C.J. Rupp
|
Bianka Buschbeck-Wolf
|
Michael Dorna
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics
pdf
bib
Syntactic and Semantic Transfer with F-Structures
Michael Dorna
|
Anette Frank
|
Josef van Genabith
|
Martin C. Emele
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics
pdf
bib
Ambiguity Preserving Machine Translation using Packed Representations
Martin C. Emele
|
Michael Dorna
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics
pdf
bib
Managing Information at Linguistic Interfaces
Johan Bos
|
C.J. Rupp
|
Bianka Buschbeck-Wolf
|
Michael Dorna
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
pdf
bib
Syntactic and Semantic Transfer with F-Structures
Michael Dorna
|
Anette Frank
|
Josef van Genabith
|
Martin C. Emele
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
pdf
bib
Ambiguity Preserving Machine Translation using Packed Representations
Martin C. Emele
|
Michael Dorna
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
1996
pdf
bib
Semantic-based Transfer
Michael Dorna
|
Martin C. Emele
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics