Fritz Hohl
2026
An improved Code-Switching Detection System for some Indic Languages
Karan Bhanushali | Fritz Hohl
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Karan Bhanushali | Fritz Hohl
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Code-switching is a common feature of multilingual communication, and identifying where the language switches reliably is essential for downstream tasks such as generating code-switched machine translations. This paper introduces CSDI, a Code-Switching Detection (CSD) system for Indic text, which jointly learns CSD, Named Entity Recognition, and Part-of-Speech tagging through a shared encoder. Leveraging multitask learning, CSDI captures linguistic cues that signal switching boundaries and achieves a new state-of-the-art macro-F1 score with near-zero 𝛥CMI across six Indic languages. The model also demonstrates strong cross-lingual transfer, effectively leveraging high-resource languages to improve low-resource performance. Despite challenges such as intra-word code-mixing and limited token-level context, CSDI establishes a new baseline for scalable, low-resource NLP research in code-mixed environments.
2023
VarDial in the Wild: Industrial Applications of LID Systems for Closely-Related Language Varieties
Fritz Hohl | Soh-eun Shim
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Fritz Hohl | Soh-eun Shim
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
This report describes first an industrial use case for identifying closely related languages, e.g.dialects, namely the detection of languages of movie subtitle documents. We then presenta 2-stage architecture that is able to detect macrolanguages in the first stage and languagevariants in the second. Using our architecture, we participated in the DSL-TL Shared Task of the VarDial 2023 workshop. We describe the results of our experiments. In the first experiment we report an accuracy of 97.8% on a set of 460 subtitle files. In our second experimentwe used DSL-TL data and achieve a macroaverage F1 of 76% for the binary task, and 54% for the three-way task in the dev set. In the open track, we augment the data with named entities retrieved from Wikidata and achieve minor increases of about 1% for both tracks.