Disfluency Detection for Vietnamese

Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen


Abstract
In this paper, we present the first empirical study for Vietnamese disfluency detection. To conduct this study, we first create a disfluency detection dataset for Vietnamese, with manual annotations over two disfluency types. We then empirically perform experiments using strong baseline models, and find that: automatic Vietnamese word segmentation improves the disfluency detection performances of the baselines, and the highest performance results are obtained by fine-tuning pre-trained language models in which the monolingual model PhoBERT for Vietnamese does better than the multilingual model XLM-R.
Anthology ID:
2022.wnut-1.21
Volume:
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
WNUT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
194–200
Language:
URL:
https://aclanthology.org/2022.wnut-1.21
DOI:
Bibkey:
Cite (ACL):
Mai Hoang Dao, Thinh Hung Truong, and Dat Quoc Nguyen. 2022. Disfluency Detection for Vietnamese. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), pages 194–200, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Cite (Informal):
Disfluency Detection for Vietnamese (Dao et al., WNUT 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.wnut-1.21.pdf
Code
 vinairesearch/phodisfluency