Thomas Krämer


2025

pdf bib
Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages
Sebastian Nordhoff | Thomas Krämer
Proceedings of the 5th Conference on Language, Data and Knowledge

Much of NLP is concerned with languages for which dictionaries, thesauri, word nets or treebanks are available. This contribution focuses on languages for which all we have might be some isolated examples with word-to-word translation. We detail the collection, aggregation, storage and querying of this database of 177k examples from 1611 languages with a special eye on enrichment via Named Entity Recognition and links to the Wikidata ontology. We also discuss pitfalls of the approach and discuss the legal status of interlinear examples.

2022

pdf bib
IMTVault: Extracting and Enriching Low-resource Language Interlinear Glossed Text from Grammatical Descriptions and Typological Survey Articles
Sebastian Nordhoff | Thomas Krämer
Proceedings of the 8th Workshop on Linked Data in Linguistics within the 13th Language Resources and Evaluation Conference

Many NLP resources and programs focus on a handful of major languages. But there are thousands of languages with low or no resources available as structured data. This paper shows the extraction of 40k examples with interlinear morpheme translation in 280 different languages from LaTeX-based publications of the open access publisher Language Science Press. These examples are transformed into Linked Data. We use LIGT for modelling and enrich the data with Wikidata and Glottolog. The data is made available as HTML, JSON, JSON-LD and N-quads, and query facilities for humans (Elasticsearch) and machines (API) are provided.