Punctuation and case restoration in code mixed Indian languages

Subhashree Tripathy, Ashis Samal


Abstract
Automatic Speech Recognition (ASR) systems are taking over in different industries starting from producing video subtitles to interactive digital assistants. ASR output can be used in automatic indexing, categorizing, searching along with normal human readability. Raw transcripts from ASR systems are difficult to interpret since it usually produces text without punctuation and case information (all lower, all upper, camel case etc.), thus limiting the performance of downstream NLP tasks. We proposed an approach to restore the punctuation and case for both English and Hinglish (i.e Hindi vocabulary in Latin script) languages. We have performed a classification task using encoder-based transformers which is a mini BERT consisting of 4 encoder layers for punctuation and case restoration instead of the traditional Seq2Seq model considering the latency constraint in real world use cases. It consists of a total number of 15 distinct classes for the model which includes 5 punctuations i.e Period(.), Comma(,), Single Quote(‘), Double Quote(”) & Question Mark(?) with different combinations of casing. The model is benchmarked on an internal dataset which was based on user conversation with the voice assistant and it achieves a F1(macro) score of 91.52% on the test set.
Anthology ID:
2022.umios-1.9
Volume:
Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Wenjuan Han, Zilong Zheng, Zhouhan Lin, Lifeng Jin, Yikang Shen, Yoon Kim, Kewei Tu
Venue:
UM-IoS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
82–86
Language:
URL:
https://aclanthology.org/2022.umios-1.9
DOI:
10.18653/v1/2022.umios-1.9
Bibkey:
Cite (ACL):
Subhashree Tripathy and Ashis Samal. 2022. Punctuation and case restoration in code mixed Indian languages. In Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures (UM-IoS), pages 82–86, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Punctuation and case restoration in code mixed Indian languages (Tripathy & Samal, UM-IoS 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.umios-1.9.pdf