Nuttanart Muansuwqan


pdf bib
Thai Word Segmentation a Lexical Semantic Approach
Krisda Khankasikam | Nuttanart Muansuwqan
Proceedings of Machine Translation Summit X: Posters

In Thai language, the word boundary is not explicitly clear, therefore, word segmentation is needed to determine word boundary in Thai sentences. Many applications of Thai Language Processing require the word segmentation. Several approaches of Thai word segmentation such as maximal matching, longest matching and n-gram model do not take semantics into consideration. This paper presents a Thai word segmentation system using semantic corpus which is composed of four steps: generating all possible candidates, proper noun consideration, semantic tagging and semantic checking. The first three steps are conducted using a dictionary. Semantic checking is carried out on the basis of corpus-based approach. Finally, we assign the semantic scores to segmented words and select the ones that contain maximum semantic scores. In order to assign semantic scores, we use a Thai proper noun database and the semantic corpus derived from ORCHID corpus. This approach is more reliable than other approaches that do not take the meaning into consideration and performs the level of accuracy at 96-99% depending on the characteristic of input and the dictionary used in the segmentation.