Sandareka Fernando
2016
Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus
Riyafa Abdul Hameed
|
Nadeeshani Pathirennehelage
|
Anusha Ihalapathirana
|
Maryam Ziyad Mohamed
|
Surangika Ranathunga
|
Sanath Jayasena
|
Gihan Dias
|
Sandareka Fernando
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
A sentence aligned parallel corpus is an important prerequisite in statistical machine translation. However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages. Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem. In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government institutions were used as the parallel text for aligning. Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary.
Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala
Sandareka Fernando
|
Surangika Ranathunga
|
Sanath Jayasena
|
Gihan Dias
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the unavailability of tags to represent some word classes and the lack of tags to capture inflection based grammatical variations of words. The new tag set, presented in this paper overcomes both of these limitations. The accuracy of available Sinhala Part-Of-Speech taggers, which are based on Hidden Markov Models, still falls far behind state of the art. Our Support Vector Machine based tagger achieved an overall accuracy of 84.68% with 59.86% accuracy for unknown words and 87.12% for known words, when the test set contains 10% of unknown words.