The VLP team participated in the DSL-ML shared task of the VarDial 2024 workshop which aims to distinguish texts in similar languages. This paper presents our approach to solving the problem and discusses our experimental and official results. We propose to integrate semantics-aware word embeddings which are learned from ConceptNet into a bidirectional long short-term memory network. This approach achieves good performance – our sys- tem is ranked in the top two or three of the best performing teams for the task.
This paper demonstrates neural network-based toolkit namely NNVLP for essential Vietnamese language processing tasks including part-of-speech (POS) tagging, chunking, Named Entity Recognition (NER). Our toolkit is a combination of bidirectional Long Short-Term Memory (Bi-LSTM), Convolutional Neural Network (CNN), Conditional Random Field (CRF), using pre-trained word embeddings as input, which outperforms previously published toolkits on these three tasks. We provide both of API and web demo for this toolkit.
This paper presents an empirical study on the application of the maximum entropy approach for part-of-speech tagging of Vietnamese text, a language with special characteristics which largely distinguish it from occidental languages. Our best tagger explores and includes useful knowledge sources for tagging Vietnamese text and gives a 93.40%overall accuracy and a 80.69%unknown word accuracy on a test set of the Vietnamese treebank. Our tagger significantly outperforms the tagger that is being used for building the Vietnamese treebank, and as far as we are aware, this is the best tagging result ever published for the Vietnamese language.
We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.
In this paper, we present the first sizable grammar built for Vietnamese using LTAG, developed over the past two years, named vnLTAG. This grammar aims at modelling written language and is general enough to be both application- and domain-independent. It can be used for the morpho-syntactic tagging and syntactic parsing of Vietnamese texts, as well as text generation. We then present a robust parsing scheme using vnLTAG and a parser for the grammar. We finish with an evaluation using a test suite.