Tanja Gaustad


2024

pdf bib
The First Universal Dependency Treebank for Tswana: Tswana-Popapolelo
Tanja Gaustad | Ansu Berg | Rigardt Pretorius | Roald Eiselen
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

This paper presents the first publicly available UD treebank for Tswana, Tswana-Popapolelo. The data used consists of the 20 Cairo CICLing sentences translated to Tswana. After pre-processing these sentences with detailed POS (XPOS) and converting them to universal POS (UPOS), we proceeded to annotate the data with dependency relations, documenting decisions for the language specific constructions. Linguistic issues encountered are described in detail as this is the first application of the UD framework to produce a dependency treebank for the Bantu language family in general and for Tswana specifically.

2023

pdf bib
Deep learning and low-resource languages: How much data is enough? A case study of three linguistically distinct South African languages
Roald Eiselen | Tanja Gaustad
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

In this paper we present a case study for three under-resourced linguistically distinct South African languages (Afrikaans, isiZulu, and Sesotho sa Leboa) to investigate the influence of data size and linguistic nature of a language on the performance of different embedding types. Our experimental setup consists of training embeddings on increasing amounts of data and then evaluating the impact of data size for the downstream task of part of speech tagging. We find that relatively little data can produce useful representations for this specific task for all three languages. Our analysis also shows that the influence of linguistic and orthographic differences between languages should not be underestimated: morphologically complex, conjunctively written languages (isiZulu in our case) need substantially more data to achieve good results, while disjunctively written languages require substantially less data. This is not only the case with regard to the data for training the embedding model, but also annotated training material for the task at hand. It is therefore imperative to know the characteristics of the language you are working on to make linguistically informed choices about the amount of data and the type of embeddings to use.

2007

pdf bib
TAT: An Author Profiling Tool with Application to Arabic Emails
Dominique Estival | Tanja Gaustad | Son Bao Pham | Will Radford | Ben Hutchinson
Proceedings of the Australasian Language Technology Workshop 2007

2004

pdf bib
A Lemma-Based Approach to a Maximum Entropy Word Sense Disambiguation System for Dutch
Tanja Gaustad
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2003

pdf bib
The importance of high-quality input for WSD: an application-oriented comparison of part-of-speech taggers
Tanja Gaustad
Proceedings of the Australasian Language Technology Workshop 2003