Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Proceedings of the 12th Workshop on Asian Language Resources (ALR12) Koiti Hasida Kam-Fai Wong Nicoletta Calzorari Key-Sun Choi December 2016

Osaka, Japan

The COLING 2016 Organizing Committee http://aclweb.org/anthology/W16-54 book ALR12:2016 An extension of ISO-Space for annotating object direction DaikiGotou HitoshiNishikawa TakenobuTokunaga Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 1–9 http://aclweb.org/anthology/W16-5401 In this paper, we extend an existing annotation scheme ISO-Space for annotating necessary spatial information for the task placing an specified object at a specified location with a specified direction according to a natural language instruction. We call such task the spatial placement problem. Our extension particularly focuses on describing the object direction, when the object is placed on the 2D plane. We conducted an annotation experiment in which a corpus of 20 situated dialogues were annotated. The annotation result showed the number of newly introduced tags by our proposal is not negligible. We also implemented an analyser that automatically assigns the proposed tags to the corpus and evaluated its performance. The result showed that the performance for entity tag was quite high ranging from 0.68 to 0.99 in F-measure, but not the case for relation tags, i.e. less than 0.4 in F-measure. inproceedings gotou-nishikawa-tokunaga:2016:ALR12 Annotation and Analysis of Discourse Relations, Temporal Relations and Multi-Layered Situational Relations in Japanese Texts KimiKaneko SakuSugawara KojiMineshima DaisukeBekki Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 10–19 http://aclweb.org/anthology/W16-5402 This paper proposes a methodology for building a specialized Japanese data set for recognizing temporal relations and discourse relations. In addition to temporal and discourse relations, multi-layered situational relations that distinguish generic and specific states belonging to different layers in a discourse are annotated. Our methodology has been applied to 170 text fragments taken from Wikinews articles in Japanese. The validity of our methodology is evaluated and analyzed in terms of degree of annotator agreement and frequency of errors. inproceedings kaneko-EtAl:2016:ALR12 Developing Universal Dependencies for Mandarin Chinese HermanLeung RafaëlPoiret Tak-sumWong XinyingChen KimGerdes JohnLee Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 20–29 http://aclweb.org/anthology/W16-5403 This article proposes a Universal Dependency Annotation Scheme for Mandarin Chinese, including POS tags and dependency analysis. We identify cases of idiosyncrasy of Mandarin Chinese that are difficult to fit into the current schema which has mainly been based on the descriptions of various Indo-European languages. We discuss differences between our scheme and those of the Stanford Chinese Dependencies and the Chinese Dependency Treebank. inproceedings leung-EtAl:2016:ALR12 Developing Corpus of Lecture Utterances Aligned to Slide Components RyoMinamiguchi MasatoshiTsuchiya Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 30–37 http://aclweb.org/anthology/W16-5404 The approach which formulates the automatic text summarization as a maximum coverage problem with knapsack constraint over a set of textual units and a set of weighted conceptual units is promising. However, it is quite important and difficult to determine the appropriate granularity of conceptual units for this formulation. In order to resolve this problem, we are examining to use components of presentation slides as conceptual units to generate a summary of lecture utterances, instead of other possible conceptual units like base noun phrases or important nouns. This paper explains our developing corpus designed to evaluate our proposing approach, which consists of presentation slides and lecture utterances aligned to presentation slide components. inproceedings minamiguchi-tsuchiya:2016:ALR12 VSoLSCSum: Building a Vietnamese Sentence-Comment Dataset for Social Context Summarization Minh-TienNguyen Dac VietLai Phong-KhacDo Duc-VuTran Minh-LeNguyen Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 38–48 http://aclweb.org/anthology/W16-5405 This paper presents VSoLSCSum, a Vietnamese linked sentence-comment dataset, which was manually created to treat the lack of standard corpora for social context summarization in Vietnamese. The dataset was collected through the keywords of 141 Web documents in 12 special events, which were mentioned on Vietnamese Web pages. Social users were asked to involve in creating standard summaries and the label of each sentence or comment. The inter-agreement calculated by Cohen's Kappa among raters after validating is 0.685. To illustrate the potential use of our dataset, a learning to rank method was trained by using a set of local and social features. Experimental results indicate that the summary model trained on our dataset outperforms state-of-the-art baselines in both ROUGE-1 and ROUGE-2 in social context summarization. inproceedings nguyen-EtAl:2016:ALR12 BCCWJ-DepPara: A Syntactic Annotation Treebank on the ‘Balanced Corpus of Contemporary Written Japanese’ MasayukiAsahara YujiMatsumoto Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 49–58 http://aclweb.org/anthology/W16-5406 Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are split from and overlaid on bunsetsu-based (base phrase unit) dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward sharing as the set of regions. The annotation was performed on the core data of ‘Balanced Corpus of Contemporary Written Japanese’, which comprised about one million words and 1980 samples from six registers, such as newspapers, books, magazines, and web texts. inproceedings asahara-matsumoto:2016:ALR12 SCTB: A Chinese Treebank in Scientific Domain ChenhuiChu ToshiakiNakazawa DaisukeKawahara SadaoKurohashi Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 59–67 http://aclweb.org/anthology/W16-5407 Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain. inproceedings chu-EtAl:2016:ALR12 Big Community Data before World Wide Web Era TomoyaIwakura TetsuroTakahashi AkihiroOhtani KunioMatsui Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 68–72 http://aclweb.org/anthology/W16-5408 This paper introduces the NIFTY-Serve corpus, a large data archive collected from Japanese discussion forums that operated via a Bulletin Board System (BBS) between 1987 and 2006. This corpus can be used in Artificial Intelligence researches such as Natural Language Processing, Community Analysis, and so on. The NIFTY-Serve corpus differs from data on WWW in three ways; (1) essentially spam- and duplication-free because of strict data collection procedures, (2) historic user-generated data before WWW, and (3) a complete data set because the service now shut down. We also introduce some examples of use of the corpus. inproceedings iwakura-EtAl:2016:ALR12 An Overview of BPPT's Indonesian Language Resources GunarsoGunarso HammamRiza Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 73–77 http://aclweb.org/anthology/W16-5409 This paper describes various Indonesian language resources that Agency for the Assessment and Application of Technology (BPPT) has developed and collected since mid 80’s when we joined MMTS (Multilingual Machine Translation System), an international project coordinated by CICC-Japan to develop a machine translation system for five Asian languages (Bahasa Indonesia, Malay, Thai, Japanese, and Chinese). Since then, we have been actively doing many types of research in the field of statistical machine translation, speech recognition, and speech synthesis which requires many text and speech corpus. Most recent cooperation within ASEAN-IVO is the development of Indonesian ALT (Asian Language Treebank) has added new NLP tools. inproceedings gunarso-riza:2016:ALR12 Creating Japanese Political Corpus from Local Assembly Minutes of 47 prefectures YasutomoKimura KeiichiTakamaru TakumaTanaka AkioKobayashi HirokiSakaji YuzuUchida HokutoOtotake ShigeruMasuyama Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 78–85 http://aclweb.org/anthology/W16-5410 This paper describes a Japanese political corpus created for interdisciplinary political research. The corpus contains the local assembly minutes of 47 prefectures from April 2011 to March 2015. This four-year period coincides with the term of office for assembly members in most autonomies. We analyze statistical data, such as the number of speakers, characters, and words, to clarify the characteristics of local assembly minutes. In addition, we identify problems associated with the different web services used by the autonomies to make the minutes available to the public. inproceedings kimura-EtAl:2016:ALR12 Selective Annotation of Sentence Parts: Identification of Relevant Sub-sentential Units GeXu XiaoyanYang Chu-RenHuang Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 86–94 http://aclweb.org/anthology/W16-5411 Many NLP tasks involve sentence-level annotation yet the relevant information is not encoded at sentence level but at some relevant parts of the sentence. Such tasks include but are not limited to: sentiment expression annotation, product feature annotation, and template annotation for Q&A systems. However, annotation of the full corpus sentence by sentence is resource intensive. In this paper, we propose an approach that iteratively extracts frequent parts of sentences for annotating, and compresses the set of sentences after each round of annotation. Our approach can also be used in preparing training sentences for binary classification (domain-related vs. noise, subjectivity vs. objectivity, etc.), assuming that sentence-type annotation can be predicted by annotation of the most relevant sub-sentences. Two experiments are performed to test our proposal and evaluated in terms of time saved and agreement of annotation. inproceedings xu-yang-huang:2016:ALR12 The Kyutech corpus and topic segmentation using a combined method TakashiYamamura KazutakaShimada ShintaroKawahara Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 95–104 http://aclweb.org/anthology/W16-5412 Summarization of multi-party conversation is one of the important tasks in natural language processing. In this paper, we explain a Japanese corpus and a topic segmentation task. To the best of our knowledge, the corpus is the first Japanese corpus annotated for summarization tasks and freely available to anyone. We call it ``the Kyutech corpus.'' The task of the corpus is a decision-making task with four participants and it contains utterances with time information, topic segmentation and reference summaries. As a case study for the corpus, we describe a method combined with LCSeg and TopicTiling for a topic segmentation task. We discuss the effectiveness and the problems of the combined method through the experiment with the Kyutech corpus. inproceedings yamamura-shimada-kawahara:2016:ALR12 Automatic Evaluation of Commonsense Knowledge for Refining Japanese ConceptNet SeiyaShudo RafalRzepka KenjiAraki Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 105–112 http://aclweb.org/anthology/W16-5413 In this paper we present two methods for automatic common sense knowledge evaluation for Japanese entries in ConceptNet ontology. Our proposed methods utilize text-mining approach: one with relation clue words and WordNet synonyms, and one without. Both methods were tested with a blog corpus. The system based on our proposed methods reached relatively high precision score for three relations (MadeOf, UsedFor, AtLocation), which is comparable with previous research using commercial search engines and simpler input. We analyze errors and discuss problems of common sense evaluation, both manual and automatic and propose ideas for further improvements. inproceedings shudo-rzepka-araki:2016:ALR12 SAMER: A Semi-Automatically Created Lexical Resource for Arabic Verbal Multiword Expressions Tokens Paradigm and their Morphosyntactic Features MohamedAl-Badrashiny AbdelatiHawwari MahmoudGhoneim MonaDiab Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 113–122 http://aclweb.org/anthology/W16-5414 Although MWE are relatively morphologically and syntactically fixed expressions, several types of flexibility can be observed in MWE, verbal MWE in particular. Identifying the degree of morphological and syntactic flexibility of MWE is very important for many Lexicographic and NLP tasks. Adding MWE variants/tokens to a dictionary resource requires characterizing the flexibility among other morphosyntactic features. Carrying out the task manually faces several challenges since it is a very laborious task time and effort wise, as well as it will suffer from coverage limitation. The problem is exacerbated in rich morphological languages where the average word in Arabic could have 12 possible inflection forms. Accordingly, in this paper we introduce a semi-automatic Arabic multiwords expressions resource (SAMER). We propose an automated method that identifies the morphological and syntactic flexibility of Arabic Verbal Multiword Expressions (AVMWE). All observed morphological variants and syntactic pattern alternations of an AVMWE are automatically acquired using large scale corpora. We look for three morphosyntactic aspects of AVMWE types investigating derivational and inflectional variations and syntactic templates, namely: 1) inflectional variation (inflectional paradigm) and calculating degree of flexibility; 2) derivational productivity; and 3) identifying and classifying the different syntactic types. We build a comprehensive list of AVMWE. Every token in the AVMWE list is lemmatized and tagged with POS information. We then search Arabic Gigaword and All ATBs for all possible flexible matches. For each AVMWE type we generate: a) a statistically ranked list of MWE-lexeme inflections and syntactic pattern alternations; b) An abstract syntactic template; and c) The most frequent form. Our technique is validated using a Golden MWE annotated list. The results shows that the quality of the generated resource is 80.04%. inproceedings albadrashiny-EtAl:2016:ALR12 Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets Tuan AnhLe DavidMoeljadi YasuhideMiura TomokoOhkuma Proceedings of the 12th Workshop on Asian Language Resources (ALR12) December 2016

Osaka, Japan

The COLING 2016 Organizing Committee 123–131 http://aclweb.org/anthology/W16-5415 This paper describes our attempt to build a sentiment analysis system for Indonesian tweets. With this system, we can study and identify sentiments and opinions in a text or document computationally. We used four thousand manually labeled tweets collected in February and March 2016 to build the model. Because of the variety of content in tweets, we analyze tweets into eight groups in total, including pos(itive), neg(ative), and neu(tral). Finally, we obtained 73.2% accuracy with Long Short Term Memory (LSTM) without normalizer. inproceedings le-EtAl:2016:ALR12