Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language

Bhavuk Singhal, Sindhuja Gopalan, Amrith Krishna, Malolan Chetlur


Abstract
ITN involves rewriting the verbalised form of text from spoken transcripts to its corresponding written form. The task inherently expects challenges in identifying ITN entries due to spelling variations in words arising out of dialects, transcription errors etc. Additionally, in Tamil, word boundaries between adjacent words in a sentence often get obscured due to Punarchi, i.e. phonetic transformation of these boundaries. Being morphologically rich, the words in Tamil show a high degree of agglutination due to inflection and clitics. The combination of such factors leads to a high degree of surface-form variations, making scalability with pure rule-based approaches difficult. Instead, we experiment with fine-tuning three pre-trained neural LMs, consisting of a seq2seq model (s2s), a non-autoregressive text editor (NAR) and a sequence tagger + rules combination (tagger). While the tagger approach works best in a fully-supervised setting, s2s performs the best (98.05 F-Score) when augmented with additional data, via bootstrapping and data augmentation (DA&B). S2S reports a cumulative percentage improvement of 20.1 %, and statistically significant gains for all our models with DA&B. Compared to a fully supervised setup, bootstrapping alone reports a percentage improvement as high as 14.12 %, even with a small seed set of 324 ITN entries.
Anthology ID:
2023.emnlp-industry.42
Volume:
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
December
Year:
2023
Address:
Singapore
Editors:
Mingxuan Wang, Imed Zitouni
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
440–450
Language:
URL:
https://aclanthology.org/2023.emnlp-industry.42
DOI:
10.18653/v1/2023.emnlp-industry.42
Bibkey:
Cite (ACL):
Bhavuk Singhal, Sindhuja Gopalan, Amrith Krishna, and Malolan Chetlur. 2023. Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 440–450, Singapore. Association for Computational Linguistics.
Cite (Informal):
Scaling Neural ITN for Numbers and Temporal Expressions in Tamil: Findings for an Agglutinative Low-resource Language (Singhal et al., EMNLP 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.emnlp-industry.42.pdf
Video:
 https://aclanthology.org/2023.emnlp-industry.42.mp4