Septina Dian Larasati

Also published as: Septina Larasati

The Votter Corpus is a new annotated corpus of social polling questions and answers. The Votter Corpus is novel in its use of the mobile application format and novel in its coverage of specific demographics. With over 26,000 polls and close to 1 millions votes, the Votter Corpus covers everyday question and answer language, primarily for users who are female and between the ages of 13-24. The corpus is annotated by topic and by popularity of particular answers. The corpus contains many unique characteristics such as emoticons, common mobile misspellings, and images associated with many of the questions. The corpus is a collection of questions and answers from The Votter App on the Android operating system. Data is created solely on this mobile platform which differs from most social media corpora. The Votter Corpus is being made available online in XML format for research and non-commercial use. The Votter android app can be downloaded for free in most android app stores.

2012

pdf bib

Rule-based Machine Translation between Indonesian and Malaysian
Raymond Hendy Susanto | Septina Dian Larasati | Francis M. Tyers
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing

pdf bib

Indonesian Dependency Treebank: Annotation and Parsing
Nathan Green | Septina Dian Larasati | Zdenek Zabokrtsky
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib abs

IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus
Septina Dian Larasati
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both sides and clitic normalization on the Indonesian side. The corpus is available in two different formats: plain', stored in text format and morphologically enriched', stored in CoNLL format. Some parts of the corpus are publicly available at the IDENTIC homepage.

pdf bib abs

Improving Word Alignment by Exploiting Adapted Word Similarity
Septina Dian Larasati
Workshop on Monolingual Machine Translation

This paper presents a method to improve a word alignment model in a phrase-based Statistical Machine Translation system for a low-resourced language using a string similarity approach. Our method captures similar words that can be seen as semi-monolingual across languages, such as numbers, named entities, and adapted/loan words. We use several string similarity metrics to measure the monolinguality of the words, such as Longest Common Subsequence Ratio (LCSR), Minimum Edit Distance Ratio (MEDR), and we also use a modified BLEU Score (modBLEU). Our approach is to add intersecting alignment points for word pairs that are orthographically similar, before applying a word alignment heuristic, to generate a better word alignment. We demonstrate this approach on Indonesian-to-English translation task, where the languages share many similar words that are poorly aligned given a limited training data. This approach gives a statistically significant improvement by up to 0.66 in terms of BLEU score.

pdf bib

Handling Indonesian Clitics: A Dataset Comparison for an Indonesian-English Statistical Machine Translation System
Septina Dian Larasati
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation