Oliver Hellwig

2025

pdf bib
Computational Sanskrit and Digital Humanities - World Sanskrit Conference 2025
Amba Kulkarni | Oliver Hellwig
Computational Sanskrit and Digital Humanities - World Sanskrit Conference 2025

2024

pdf bib abs
One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks
Sebastian Nehrdich | Oliver Hellwig | Kurt Keutzer
Findings of the Association for Computational Linguistics: EMNLP 2024

Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

pdf bib abs
The Vedic Compound Dataset
Sven Sellmer | Oliver Hellwig
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024

This paper introduces the Vedic Compound Dataset (VCD), the first resource providing annotated compounds from Vedic Sanskrit, a South Asian Indo-European language used from ca. 1500 to 500 BCE. The VCD aims at facilitating the study of language change in early Indo-Iranian and offers comparative material for quantitative cross-linguistic research on compounds. The process of annotating Vedic compounds is complex as they contain five of the six basic types of compounds defined by Scalise & Bisetto (2005), which are, however, not consistently marked in morphosyntax, making their automatic classification a significant challenge. The paper details the process of collecting and preprocessing the relevant data, with a particular focus on the question of how to distinguish exocentric from endocentric usage. It further discusses experiments with a simple ML classifier that uses compound internal syntactic relations, outlines the composition of the dataset, and sketches directions for future research.

2023

pdf bib abs
Hedging in diachrony: the case of Vedic Sanskrit iva
Erica Biagetti | Oliver Hellwig | Sven Sellmer
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

The rhetoric strategy of hedging serves to attenuate speech acts and their semantic content, as in English ‘kind of’ or ‘somehow’. While hedging has recently met with increasing interest in linguistic research, most studies deal with modern languages, preferably English, and take a synchronic approach. This paper complements this research by tracing the diachronic syntactic flexibilization of the Vedic Sanskrit particle iva from a marker of comparison (‘like’) to a full-fledged adaptor. We discuss the outcomes of a diachronic Bayesian framework applied to iva constructions in a Universal Dependencies treebank, and supplement these results with a qualitative discussion of relevant text passages.

pdf bib
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference
Amba Kulkarni | Oliver Hellwig
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference

pdf bib
The Vedic corpus as a graph. An updated version of Bloomfields Vedic Concordance
Oliver Hellwig | Sven Sellmer | Kyoko Amano
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference

2022

pdf bib abs
Detecting Diachronic Syntactic Developments in Presence of Bias Terms
Oliver Hellwig | Sven Sellmer
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

Corpus-based studies of diachronic syntactic changes are typically guided by the results of previous qualitative research. When such results are missing or, as is the case for Vedic Sanskrit, are restricted to small parts of a transmitted corpus, an exploratory framework that detects such changes in a data-driven fashion can substantially support the research process. In this paper, we introduce a customized version of the infinite relational model that groups syntactic constituents based on their structural similarities and their diachronic distributions. We propose a simple way to control for register and intellectual affiliation, and discuss our findings for four syntactic structures in Vedic texts.

pdf bib abs
Accurate Dependency Parsing and Tagging of Latin
Sebastian Nehrdich | Oliver Hellwig
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

Having access to high-quality grammatical annotations is important for downstream tasks in NLP as well as for corpus-based research. In this paper, we describe experiments with the Latin BERT word embeddings that were recently be made available by Bamman and Burns (2020). We show that these embeddings produce competitive results in the low-level task morpho-syntactic tagging. In addition, we describe a graph-based dependency parser that is trained with these embeddings and that clearly outperforms various baselines.

2020

pdf bib abs
The Treebank of Vedic Sanskrit
Oliver Hellwig | Salvatore Scarlata | Elia Ackermann | Paul Widmer
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper introduces the first treebank of Vedic Sanskrit, a morphologically rich ancient Indian language that is of central importance for linguistic and historical research. The selection of the more than 3,700 sentences contained in this treebank reflects the development of metrical and prose texts over a period of 600 years. We discuss how these sentences are annotated in the Universal Dependencies scheme and which syntactic constructions required special attention. In addition, we describe a syntactic labeler based on neural networks that supports the initial annotation of the treebank, and whose evaluation can be helpful for setting up a full syntactic parser of Vedic Sanskrit.

pdf bib abs
Dating and Stratifying a Historical Corpus with a Bayesian Mixture Model
Oliver Hellwig
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

This paper introduces and evaluates a Bayesian mixture model that is designed for dating texts based on the distributions of linguistic features. The model is applied to the corpus of Vedic Sanskrit the historical structure of which is still unclear in many details. The evaluation concentrates on the interaction between time, genre and linguistic features, detecting those whose distributions are clearly coupled with the historical time. The evaluation also highlights the problems that arise when quantitative results need to be reconciled with philological insights.

pdf bib abs
Evaluating Neural Morphological Taggers for Sanskrit
Ashim Gupta | Amrith Krishna | Pawan Goyal | Oliver Hellwig
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Neural sequence labelling approaches have achieved state of the art results in morphological tagging. We evaluate the efficacy of four standard sequence labelling models on Sanskrit, a morphologically rich, fusional Indian language. As its label space can theoretically contain more than 40,000 labels, systems that explicitly model the internal structure of a label are more suited for the task, because of their ability to generalise to labels not seen during training. We find that although some neural models perform better than others, one of the common causes for error for all of these models is mispredictions due to syncretism.

2018

pdf bib abs
Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks
Oliver Hellwig | Sebastian Nehrdich
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

The paper introduces end-to-end neural network models that tokenize Sanskrit by jointly splitting compounds and resolving phonetic merges (Sandhi). Tokenization of Sanskrit depends on local phonetic and distant semantic features that are incorporated using convolutional and recurrent elements. Contrary to most previous systems, our models do not require feature engineering or extern linguistic resources, but operate solely on parallel versions of raw and segmented text. The models discussed in this paper clearly improve over previous approaches to Sanskrit word segmentation. As they are language agnostic, we will demonstrate that they also outperform the state of the art for the related task of German compound splitting.

pdf bib
Multi-layer Annotation of the Rigveda
Oliver Hellwig | Heinrich Hettrich | Ashutosh Modi | Manfred Pinkal
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
AET: Web-based Adjective Exploration Tool for German
Tatiana Bladier | Esther Seyffarth | Oliver Hellwig | Wiebke Petersen
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Coarse Semantic Classification of Rare Nouns Using Cross-Lingual Data and Recurrent Neural Networks
Oliver Hellwig
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Long papers

pdf bib
Unsupervised Induction of Compositional Types for English Adjective-Noun Pairs
Wiebke Petersen | Oliver Hellwig
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers

2016

pdf bib abs
Detecting Sentence Boundaries in Sanskrit Texts
Oliver Hellwig
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The paper applies a deep recurrent neural network to the task of sentence boundary detection in Sanskrit, an important, yet underresourced ancient Indian language. The deep learning approach improves the F scores set by a metrical baseline and by a Conditional Random Field classifier by more than 10%.

pdf bib abs
Exploring the value space of attributes: Unsupervised bidirectional clustering of adjectives in German
Wiebke Petersen | Oliver Hellwig
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The paper presents an iterative bidirectional clustering of adjectives and nouns based on a co-occurrence matrix. The clustering method combines a Vector Space Models (VSM) and the results of a Latent Dirichlet Allocation (LDA), whose results are merged in each iterative step. The aim is to derive a clustering of German adjectives that reflects latent semantic classes of adjectives, and that can be used to induce frame-based representations of nouns in a later step. We are able to show that the method induces meaningful groups of adjectives, and that it outperforms a baseline k-means algorithm.

pdf bib abs
Improving the Morphological Analysis of Classical Sanskrit
Oliver Hellwig
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

The paper describes a new tagset for the morphological disambiguation of Sanskrit, and compares the accuracy of two machine learning methods (Conditional Random Fields, deep recurrent neural networks) for this task, with a special focus on how to model the lexicographic information. It reports a significant improvement over previously published results.

2010

pdf bib abs
Using NLP Methods for the Analysis of Rituals
Nils Reiter | Oliver Hellwig | Anand Mishra | Anette Frank | Jens Burkhardt
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper gives an overview of an interdisciplinary research project that is concerned with the application of computational linguistics methods to the analysis of the structure and variance of rituals, as investigated in ritual science. We present motivation and prospects of a computational approach to ritual research, and explain the choice of specific analysis techniques. We discuss design decisions for data collection and processing and present the general NLP architecture. For the analysis of ritual descriptions, we apply the frame semantics paradigm with newly invented frames where appropriate. Using scientific ritual research literature, we experimented with several techniques of automatic extraction of domain terms for the domain of rituals. As ritual research is a highly interdisciplinary endeavour, a vocabulary common to all sub-areas of ritual research can is hard to specify and highly controversial. The domain terms extracted from ritual research literature are used as a basis for a common vocabulary and thus help the creation of ritual specific frames. We applied the tf*idf, 2 and PageRank algorithm to our ritual research literature corpus and two non-domain corpora: The British National Corpus and the British Academic Written English corpus. All corpora have been part of speech tagged and lemmatized. The domain terms have been evaluated by two ritual experts independently. Interestingly, the results of the algorithms were different for different parts of speech. This finding is in line with the fact that the inter-annotator agreement also differs between parts of speech.