Construction and Analysis of a Large Vietnamese Text Corpus

Dieu-Thu Le, Uwe Quasthoff


Abstract
This paper presents a new Vietnamese text corpus which contains around 4.05 billion words. It is a collection of Wikipedia texts, newspaper articles and random web texts. The paper describes the process of collecting, cleaning and creating the corpus. Processing Vietnamese texts faced several challenges, for example, different from many Latin languages, Vietnamese language does not use blanks for separating words, hence using common tokenizers such as replacing blanks with word boundary does not work. A short review about different approaches of Vietnamese tokenization is presented together with how the corpus has been processed and created. After that, some statistical analysis on this data is reported including the number of syllable, average word length, sentence length and topic analysis. The corpus is integrated into a framework which allows searching and browsing. Using this web interface, users can find out how many times a particular word appears in the corpus, sample sentences where this word occurs, its left and right neighbors.
Anthology ID:
L16-1065
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
412–416
Language:
URL:
https://aclanthology.org/L16-1065
DOI:
Bibkey:
Cite (ACL):
Dieu-Thu Le and Uwe Quasthoff. 2016. Construction and Analysis of a Large Vietnamese Text Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 412–416, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Construction and Analysis of a Large Vietnamese Text Corpus (Le & Quasthoff, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1065.pdf