<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W16">
  <paper id="3700">
    <title>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</title>
    <editor>Dekai Wu</editor>
    <editor>Pushpak Bhattacharyya</editor>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <url>http://aclweb.org/anthology/W16-37</url>
    <bibtype>book</bibtype>
    <bibkey>WSSANLP2016:2016</bibkey>
  </paper>

  <paper id="3701">
    <title>Compound Type Identification in Sanskrit: What Roles do the Corpus and Grammar Play?</title>
    <author><first>Amrith</first><last>Krishna</last></author>
    <author><first>Pavankumar</first><last>Satuluri</last></author>
    <author><first>Shubham</first><last>Sharma</last></author>
    <author><first>Apurv</first><last>Kumar</last></author>
    <author><first>Pawan</first><last>Goyal</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>1&#8211;10</pages>
    <url>http://aclweb.org/anthology/W16-3701</url>
    <abstract>classes namely,  Avyayibhava, Tatpurudsa, 
	Bahuvrihi and Dvandva. Our classification is based on the
	traditional classification system followed by the ancient grammar treatise
	Adsdtadhyayi, proposed by Padnini 25 centuries back.
	We construct an elaborate features space for our system by combining
	conditional rules from the grammar Ast, semantic relations between the
	compound components from a lexical database Amarakodsa and linguistic
	structures from the data using Adaptor Grammars. Our in-depth analysis of the
	feature space highlight inadequacy of Ast, a generative grammar, in
	classifying the data samples. Our experimental results validate the
	effectiveness of using lexical databases as suggested by Amba Kulkarni and Anil
	Kumar, and put forward a new research direction by introducing linguistic
	patterns obtained from Adaptor grammars for effective identification of
	compound type. We utilise an ensemble based approach, specifically designed for
	handling skewed datasets and we  %and Experimenting with various classification
	methods, we
	  achieve an overall accuracy of 0.77 using random forest classifiers.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>krishna-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3702">
    <title>Comparison of Grapheme-to-Phoneme Conversion Methods on a Myanmar Pronunciation Dictionary</title>
    <author><first>Ye</first><last>Kyaw Thu</last></author>
    <author><first>Win</first><last>Pa Pa</last></author>
    <author><first>Yoshinori</first><last>Sagisaka</last></author>
    <author><first>Naoto</first><last>Iwahashi</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>11&#8211;22</pages>
    <url>http://aclweb.org/anthology/W16-3702</url>
    <abstract>Grapheme-to-Phoneme (G2P) conversion is the task of predicting the
	pronunciation of a word given its graphemic or written form. It is a highly
	important part of both automatic speech recognition (ASR) and text-to-speech
	(TTS) systems. In this paper, we evaluate seven G2P conversion approaches:
	Adaptive Regularization of Weight Vectors (AROW) based structured learning
	(S-AROW), Conditional Random Field (CRF), Joint-sequence models (JSM),
	phrase-based statistical machine translation (PBSMT), Recurrent Neural Network
	(RNN),                                                        Support Vector Machine
	(SVM)
	based
	point-wise
	classification, 
	Weighted Finite-state Transducers (WFST) on a manually tagged Myanmar phoneme
	dictionary. The G2P bootstrapping experimental results were measured with both
	automatic phoneme error rate (PER) calculation and also manual checking in
	terms of voiced/unvoiced, tones, consonant and vowel errors. The result shows
	that CRF, PBSMT and WFST approaches are the best performing methods for G2P
	conversion on Myanmar language.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>kyawthu-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3703">
    <title>Character-Aware Neural Networks for Arabic Named Entity Recognition for Social Media</title>
    <author><first>Mourad</first><last>Gridach</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>23&#8211;32</pages>
    <url>http://aclweb.org/anthology/W16-3703</url>
    <abstract>Named Entity Recognition (NER) is the task of classifying or labelling atomic
	elements in the text into categories such as Person, Location or Organisation.
	For Arabic language, recognizing named entities is a challenging task because
	of the complexity and the unique characteristics of this language. In addition,
	most of the previous work focuses on Modern Standard Arabic (MSA), however,
	recognizing named entities in social media is becoming more interesting these
	days. Dialectal Arabic (DA) and MSA are both used in social media, which is
	deemed as another challenging task. Most state-of-the-art Arabic NER systems
	count heavily on handcrafted engineering features and lexicons which is time
	consuming. In this paper, we introduce a novel neural network architecture
	which benefits both from character- and word-level representations
	automatically, by using combination of bidirectional LSTM and Conditional
	Random Field (CRF), eliminating the need for most feature engineering.
	Moreover, our model relies on unsupervised word representations learned from
	unannotated corpora. Experimental results demonstrate that our model achieves
	state-of-the-art performance on publicly available benchmark for Arabic NER for
	social media and surpassing the previous system by a large margin.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gridach:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3704">
    <title>Development of a Bengali parser by cross-lingual transfer from Hindi</title>
    <author><first>Ayan</first><last>Das</last></author>
    <author><first>Agnivo</first><last>Saha</last></author>
    <author><first>Sudeshna</first><last>Sarkar</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>33&#8211;43</pages>
    <url>http://aclweb.org/anthology/W16-3704</url>
    <abstract>In recent years there has been a lot of interest in cross-lingual parsing for
	developing treebanks for languages with small or no annotated treebanks. In
	this paper, we explore the development of a cross-lingual transfer parser from
	Hindi to Bengali using a Hindi parser and a Hindi-Bengali parallel corpus. A
	parser is trained and applied to the Hindi sentences of the parallel corpus
	and the parse trees are projected to construct probable parse trees of the
	corresponding Bengali sentences. Only about 14% of these trees are complete
	(transferred trees contain all the target sentence words) and they are used to
	construct a Bengali parser. We relax the criteria of completeness to consider
	well-formed trees (43% of the trees) leading to an improvement. We note
	that the words often do not have a one-to-one mapping in the two languages but
	considering sentences at the chunk-level results in better correspondence
	between the two languages. Based on this we present a method to use chunking as
	a preprocessing step and do the transfer on the chunk trees. We find that about
	72% of the projected parse trees of Bengali are now well-formed. The resultant
	parser achieves significant improvement in both Unlabeled Attachment Score
	(UAS) as
	well as Labeled Attachment Score (LAS) over the baseline word-level transferred
	parser.
	},
  url       = {http://aclweb.org/anthology/W16-3704}
}
</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>das-saha-sarkar:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3705">
    <title>Sinhala Short Sentence Similarity Calculation using Corpus-Based and Knowledge-Based Similarity Measures</title>
    <author><first>Jcs</first><last>Kadupitiya</last></author>
    <author><first>Surangika</first><last>Ranathunga</last></author>
    <author><first>Gihan</first><last>Dias</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>44&#8211;53</pages>
    <url>http://aclweb.org/anthology/W16-3705</url>
    <abstract>Currently, corpus based-similarity, string-based similarity, and
	knowledge-based similarity techniques are used to compare short phrases.
	However, no work has been conducted on the similarity of phrases in Sinhala
	language. In this paper, we present a hybrid methodology to compute the
	similarity between two Sinhala sentences using a Semantic Similarity
	Measurement technique (corpus-based similarity measurement plus knowledge-based
	similarity measurement) that makes use of word order information. Since Sinhala
	WordNet is still under construction, we used lexical resources in performing
	this semantic similarity calculation. Evaluation using 4000 sentence pairs
	yielded an average MSE of 0.145 and a Pearson correla-tion factor of 0.832.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>kadupitiya-ranathunga-dias:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3706">
    <title>Enriching Source for English-to-Urdu Machine Translation</title>
    <author><first>Bushra</first><last>Jawaid</last></author>
    <author><first>Amir</first><last>Kamran</last></author>
    <author><first>Ond&#x159;ej</first><last>Bojar</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>54&#8211;63</pages>
    <url>http://aclweb.org/anthology/W16-3706</url>
    <abstract>This paper focuses on the generation of case markers for free word order
	languages
	that use case markers as phrasal clitics for marking the
	relationship between the dependent-noun and its head. The generation of such
	clitics becomes essential task especially when translating from fixed word
	order
	languages where syntactic relations are identified by the positions of the
	dependent-nouns. To address the problem of missing markers on source-side,
	artificial markers are added in source to improve alignments with its target
	counterparts.
	Up to 1 BLEU point increase is observed over the baseline on different test
	sets for English-to-Urdu.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jawaid-kamran-bojar:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3707">
    <title>The IMAGACT4ALL Ontology of Animated Images: Implications for Theoretical and Machine Translation of Action Verbs from English-Indian Languages</title>
    <author><first>Pitambar</first><last>Behera</last></author>
    <author><first>Sharmin</first><last>Muzaffar</last></author>
    <author><first>Atul kr.</first><last>Ojha</last></author>
    <author><first>Girish</first><last>Jha</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>64&#8211;73</pages>
    <url>http://aclweb.org/anthology/W16-3707</url>
    <abstract>Action verbs are one of the frequently occurring linguistic elements in any
	given natural language as the speakers use them during every linguistic
	intercourse. However, each language expresses action verbs in its own
	inherently unique manner by categorization. One verb can refer to several
	interpretations of actions and one action can be expressed by more than one
	verb. The inter-language and intra-language variations create ambiguity for the
	translation of languages from the source language to target language with
	respect to action verbs. IMAGACT is a corpus-based ontological platform of
	action verbs translated from prototypic animated images explained in English
	and Italian as meta-languages. In this paper, we are presenting the issues and
	challenges in translating action verbs of Indian languages as target and
	English as source language by observing the animated images. Among the ten
	Indian languages which have been annotated so far on the platform are Sanskrit,
	Hindi, Urdu, Odia (Oriya), Bengali, Manipuri, Tamil, Assamese, Magahi and
	Marathi. Out of them, Manipuri belongs to the Sino-Tibetan, Tamil comes off the
	Dravidian and the rest owe their genesis to the Indo-Aryan language family. One
	of the issues is that the one-word morphological English verbs are translated
	into most of the Indian languages as verbs having more than one-word form; for
	instance as in the case of conjunct, compound, serial verbs and so on. We are
	further presenting a cross-lingual comparison of action verbs among Indian
	languages. In addition, we are also dealing with the issues in disambiguating
	animated images by the L1 native speakers using competence-based judgements and
	the theoretical and machine translation implications they bear.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>behera-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3708">
    <title>Crowdsourcing-based Annotation of Emotions in Filipino and English Tweets</title>
    <author><first>Fermin Roberto</first><last>Lapitan</last></author>
    <author><first>Riza Theresa</first><last>Batista-Navarro</last></author>
    <author><first>Eliezer</first><last>Albacea</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>74&#8211;82</pages>
    <url>http://aclweb.org/anthology/W16-3708</url>
    <abstract>The automatic analysis of emotions conveyed in social media content, e.g.,
	tweets, has many beneficial applications. In the Philippines, one of the most
	disaster-prone countries in the world, such methods could potentially enable
	first responders to make timely decisions despite the risk of data deluge.
	However, recognising emotions expressed in Philippine-generated tweets, which
	are mostly written in Filipino, English or a mix of both, is a non-trivial
	task. In order to facilitate the development of natural language processing
	(NLP) methods that will automate such type of analysis, we have built a corpus
	of tweets whose predominant emotions have been manually annotated by means of
	crowdsourcing. Defining measures ensuring that only high-quality annotations
	were retained, we have produced a gold standard corpus of 1,146
	emotion-labelled Filipino and English tweets. We validate the value of this
	manually produced resource by demonstrating that an automatic
	emotion-prediction method based on the use of a publicly available word-emotion
	association lexicon was unable to reproduce the labels assigned via
	crowdsourcing.
	While we are planning to make a few extensions to the corpus in the near
	future, its current version has been made publicly available in order to foster
	the development of emotion analysis methods based on advanced Filipino and
	English NLP.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>lapitan-batistanavarro-albacea:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3710">
    <title>Sentiment Analysis of Tweets in Three Indian Languages</title>
    <author><first>Shanta</first><last>Phani</last></author>
    <author><first>Shibamouli</first><last>Lahiri</last></author>
    <author><first>Arindam</first><last>Biswas</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>93&#8211;102</pages>
    <url>http://aclweb.org/anthology/W16-3710</url>
    <abstract>In this paper, we describe the results of sentiment analysis on tweets in three
	Indian languages -- Bengali, Hindi, and Tamil. We used the recently released
	SAIL dataset (Patra et al., 2015), and obtained state-of-the-art results in all
	three languages. Our features are simple, robust, scalable, and
	language-independent. Further, we show that these simple features provide
	better results than more complex and language-specific features, in two
	separate classification tasks. Detailed feature analysis and error analysis
	have been reported, along with learning curves for Hindi and Bengali.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>phani-lahiri-biswas:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3711">
    <title>Dealing with Linguistic Divergences in English-Bhojpuri Machine Translation</title>
    <author><first>Pitambar</first><last>Behera</last></author>
    <author><first>Neha</first><last>Mourya</last></author>
    <author><first>Vandana</first><last>Pandey</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>103&#8211;113</pages>
    <url>http://aclweb.org/anthology/W16-3711</url>
    <abstract>In Machine Translation, divergence is one of the major barriers which plays a
	deciding role in determining the efficiency of the system at hand. Translation
	divergences originate when there is structural discrepancies between the input
	and the output languages. It can be of various types based on the issues we are
	addressing to such as linguistic, cultural, communicative and so on. Owing to
	the fact that two languages owe their origin to different language families,
	linguistic divergences emerge. The present study attempts at categorizing
	different types of linguistic divergences: the lexical-semantic and syntactic.
	In addition, it also helps identify and resolve the divergent linguistic
	features between English as source language and Bhojpuri as target language
	pair. Dorr’s theoretical framework (1994, 1994a) has been followed in the
	classification and resolution procedure. Furthermore, so far as the methodology
	is concerned, we have adhered to the Dorr’s Lexical Conceptual Structure for
	the resolution of divergences. This research will prove to be beneficial for
	developing efficient MT systems if the mentioned factors are incorporated
	considering the inherent structural constraints between source and target
	languages.ated considering the inherent structural constraints between SL and
	TL pairs.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>behera-mourya-pandey:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3712">
    <title>The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese</title>
    <author><first>Miki</first><last>Nishioka</last></author>
    <author><first>Shiro</first><last>Akasegawa</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>114&#8211;123</pages>
    <url>http://aclweb.org/anthology/W16-3712</url>
    <abstract>In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH),
	one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We
	also point out notable problems we’ve encountered in the web corpus and the
	special concordancer. After observing the kind of technical problems we
	encountered, especially regarding annotation tagged by Shiva Reddy’s tagger,
	we argue how they can be solved when using COSH for linguistic studies.
	Finally, we mention the kinds of linguistic research that we non-native
	speakers of Hindi can do using the corpus, especially in pragmatics and
	semantics, and from a comparative viewpoint to Japanese.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nishioka-akasegawa:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3713">
    <title>Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus</title>
    <author><first>Riyafa</first><last>Abdul Hameed</last></author>
    <author><first>Nadeeshani</first><last>Pathirennehelage</last></author>
    <author><first>Anusha</first><last>Ihalapathirana</last></author>
    <author><first>Maryam</first><last>Ziyad Mohamed</last></author>
    <author><first>Surangika</first><last>Ranathunga</last></author>
    <author><first>Sanath</first><last>Jayasena</last></author>
    <author><first>Gihan</first><last>Dias</last></author>
    <author><first>Sandareka</first><last>Fernando</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>124&#8211;132</pages>
    <url>http://aclweb.org/anthology/W16-3713</url>
    <abstract>A sentence aligned parallel corpus is an important prerequisite in statistical
	machine translation. However, manual creation of such a parallel corpus is time
	consuming, and requires experts fluent in both languages. Automatic creation of
	a sentence aligned parallel corpus using parallel text is the solution to this
	problem. In this paper, we present the first ever empirical evaluation carried
	out to identify the best method to automatically create a sentence aligned
	Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government
	institutions were used as the parallel text for aligning. Despite both Sinhala
	and Tamil being under-resourced languages, we were able to achieve an F-score
	value  of 0.791 using a hybrid approach that makes use of a bilingual
	dictionary.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>abdulhameed-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3714">
    <title>Clustering-based Phonetic Projection in Mismatched Crowdsourcing Channels for Low-resourced ASR</title>
    <author><first>Wenda</first><last>Chen</last></author>
    <author><first>Mark</first><last>Hasegawa-Johnson</last></author>
    <author><first>Nancy</first><last>Chen</last></author>
    <author><first>Preethi</first><last>Jyothi</last></author>
    <author><first>Lav</first><last>Varshney</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>133&#8211;141</pages>
    <url>http://aclweb.org/anthology/W16-3714</url>
    <abstract>Acquiring labeled speech for low-resource languages is a difficult task in the
	absence of native speakers of the language. One solution to this problem
	involves collecting speech transcriptions from crowd workers who are foreign or
	non-native speakers of a given target language. From these mismatched
	transcriptions, one can derive probabilistic phone transcriptions that are
	defined over the set of all target language phones using a noisy channel model.
	This paper extends prior work on deriving probabilistic transcriptions (PTs)
	from mismatched transcriptions by 1) modelling multilingual channels and 2)
	introducing a clustering-based phonetic mapping technique to improve the
	quality of PTs. Mismatched crowdsourcing for multilingual channels has certain
	properties of projection mapping, e.g., it can be interpreted as a clustering
	based on singular value decomposition of the segment alignments. To this end,
	we explore the use of distinctive feature weights, lexical tone confusions, and
	a two-step clustering algorithm to learn projections of phoneme segments from
	mismatched multilingual transcriber languages to the target language. We
	evaluate our techniques using mismatched transcriptions for Cantonese speech
	acquired from native English and Mandarin speakers. We observe a 5-9% relative
	reduction in phone error rate for the predicted Cantonese phone transcriptions
	using our proposed techniques compared with the previous PT method.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>chen-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3715">
    <title>Improving the Morphological Analysis of Classical Sanskrit</title>
    <author><first>Oliver</first><last>Hellwig</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>142&#8211;151</pages>
    <url>http://aclweb.org/anthology/W16-3715</url>
    <abstract>The paper describes a new tagset for the morphological disambiguation of
	Sanskrit, and compares the accuracy of two machine learning methods
	(Conditional Random Fields, deep recurrent neural networks) for this task, with
	a special focus on how to model the lexicographic information. It reports a
	significant improvement over previously published results.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hellwig:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3716">
    <title>Query Translation for Cross-Language Information Retrieval using Multilingual Word Clusters</title>
    <author><first>Paheli</first><last>Bhattacharya</last></author>
    <author><first>Pawan</first><last>Goyal</last></author>
    <author><first>Sudeshna</first><last>Sarkar</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>152&#8211;162</pages>
    <url>http://aclweb.org/anthology/W16-3716</url>
    <abstract>In Cross-Language Information Retrieval, finding the appropriate translation of
	the source language query has always been a difficult problem to solve. We
	propose a technique towards solving this problem with the help of multilingual
	word clusters obtained from multilingual word embeddings. We use word
	embeddings of the languages projected to a common vector space on which a
	community-detection algorithm is applied to find clusters such that words that
	represent
	the same concept from different languages fall in the same group. We utilize
	these multilingual word clusters to perform query translation for
	Cross-Language Information Retrieval for three languages - English, Hindi and
	Bengali. We have experimented with the FIRE 2012 and Wikipedia datasets and
	have shown improvements over several standard methods like dictionary-based
	method, a transliteration-based model and Google Translate.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bhattacharya-goyal-sarkar:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3717">
    <title>A study of attention-based neural machine translation model on Indian languages</title>
    <author><first>Ayan</first><last>Das</last></author>
    <author><first>Pranay</first><last>Yerra</last></author>
    <author><first>Ken</first><last>Kumar</last></author>
    <author><first>Sudeshna</first><last>Sarkar</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>163&#8211;172</pages>
    <url>http://aclweb.org/anthology/W16-3717</url>
    <abstract>Neural machine translation (NMT) models have recently been shown to be very
	successful in machine translation (MT). The use of LSTMs in machine translation
	has significantly improved the translation performance for longer sentences by
	being able to capture the context and long range correlations of the sentences
	in their hidden layers. The attention model based NMT system (Bahdanau et al.,
	2014) has become the state-of-the-art, performing equal or better than other
	statistical MT approaches. In this paper, we wish to study the performance of
	the
	attention-model based NMT system (Bahdanau et al., 2014) on the Indian language
	pair, Hindi and Bengali, and do an analysis on the types or errors that occur
	in case when the languages are morphologically rich and there is a scarcity of
	large parallel training corpus. We then carry out certain post-processing
	heuristic steps to improve the quality of the translated statements and suggest
	further measures that can be carried out.
	Author{4}{Affiliation}},
  url       = {http://aclweb.org/anthology/W16-3717}
}
</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>das-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3718">
    <title>Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala</title>
    <author><first>Sandareka</first><last>Fernando</last></author>
    <author><first>Surangika</first><last>Ranathunga</last></author>
    <author><first>Sanath</first><last>Jayasena</last></author>
    <author><first>Gihan</first><last>Dias</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>173&#8211;182</pages>
    <url>http://aclweb.org/anthology/W16-3718</url>
    <abstract>This paper presents a new comprehensive multi-level Part-Of-Speech tag set and
	a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language.
	The currently available tag set for Sinhala has two limitations: the
	unavailability of tags to represent some word classes and the lack of tags to
	capture inflection based grammatical variations of words. The new tag set,
	presented in this paper overcomes both of these limitations. The accuracy of
	available Sinhala Part-Of-Speech taggers, which are based on Hidden Markov
	Models, still falls far behind state of the art. Our Support Vector Machine
	based tagger achieved an overall accuracy of 84.68% with 59.86% accuracy for
	unknown words and 87.12% for known words, when the test set contains 10% of
	unknown words.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>fernando-EtAl:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3719">
    <title>Align Me: A framework to generate Parallel Corpus Using OCRs and Bilingual Dictionaries</title>
    <author><first>Priyam</first><last>Bakliwal</last></author>
    <author><first>Devadath</first><last>V V</last></author>
    <author><first>C V</first><last>Jawahar</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>183&#8211;187</pages>
    <url>http://aclweb.org/anthology/W16-3719</url>
    <abstract>Multilingual language processing tasks like statistical machine translation and
	cross language information retrieval rely mainly on availability of accurate
	parallel corpora. Manual construction of such corpus can be extremely expensive
	and time consuming. In this paper we present a simple yet efficient method to
	generate huge amount of reasonably accurate parallel corpus with minimal user
	efforts. We utilize the availability of large number of English books and their
	corresponding translations in other languages to build parallel corpus. Optical
	Character Recognizing systems are used to digitize such books. We propose a
	robust dictionary based parallel corpus generation system for alignment of
	multilingual text at different levels of granularity (sentence, paragraphs,
	etc). We show the performance of our proposed method on a manually aligned
	dataset of 300 Hindi-English sentences and 100 English-Malayalam sentences.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bakliwal-vv-jawahar:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3720">
    <title>Learning Indonesian-Chinese Lexicon with Bilingual Word Embedding Models and Monolingual Signals</title>
    <author><first>Xinying</first><last>Qiu</last></author>
    <author><first>Gangqin</first><last>Zhu</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>188&#8211;193</pages>
    <url>http://aclweb.org/anthology/W16-3720</url>
    <abstract>We present a research on learning Indonesian-Chinese bilingual lexicon using
	monolingual word embedding and bilingual seed lexicons to build shared
	bilingual word embedding space. We take the first attempt to examine the impact
	of different monolingual signals for the choice of seed lexicons on the model
	performance.  We found that although monolingual signals alone do not seem to
	outperform signals coverings all words, the significant improvement for
	learning word translation of the same signal types may suggest that linguistic
	features possess value for further study in distinguishing the semantic margins
	of the shared word embedding space.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>qiu-zhu:2016:WSSANLP2016</bibkey>
  </paper>

  <paper id="3721">
    <title>Creating rich online dictionaries for the Lao&#8211;French language pair, reusable for Machine Translation</title>
    <author><first>Vincent</first><last>Berment</last></author>
    <booktitle>Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)</booktitle>
    <month>December</month>
    <year>2016</year>
    <address>Osaka, Japan</address>
    <publisher>The COLING 2016 Organizing Committee</publisher>
    <pages>194&#8211;197</pages>
    <url>http://aclweb.org/anthology/W16-3721</url>
    <abstract>In this paper, we present how we generated two rich online bilingual
	dictionaries — Lao-French and French-Lao — from unstructured dictionaries
	in Microsoft Word files. Then we shortly discuss the possible reuse of the
	lexical data for Machine Translation projects.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>berment:2016:WSSANLP2016</bibkey>
  </paper>

</volume>

