TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology Sandra Kuebler Garrett Nicolai October 2018

Brussels, Belgium

Association for Computational Linguistics http://www.aclweb.org/anthology/W18-58 book SIGMORPHON:2018 Efficient Computation of Implicational Universals in Constraint-Based Phonology Through the Hyperplane Separation Theorem GiorgioMagri TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 1–10 http://www.aclweb.org/anthology/W18-5801 This paper focuses on the most basic implicational universals in phonological theory, called T-orders after Anttila and Andrus (2006). It develops necessary and sufficient constraint characterizations of T-orders within Harmonic Grammar and Optimality Theory. These conditions rest on the rich convex geometry underlying these frameworks. They are phonologically intuitive and have significant algorithmic implications. inproceedings magri:2018:SIGMORPHON Lexical Networks in !Xung Syed-AmadHussain MichaElsner AmandaMiller TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 11–20 http://www.aclweb.org/anthology/W18-5802 We investigate the lexical network properties of the large phoneme inventory Southern African language Mangetti Dune !Xung as it compares to English and other commonly-studied languages. Lexical networks are graphs in which nodes (words) are linked to their minimal pairs; global properties of these networks are believed to mediate lexical access in the minds of speakers. We show that the network properties of !Xung are within the range found in previously-studied languages. By simulating data ("pseudolexicons") with varying levels of phonotactic structure, we find that the lexical network properties of !Xung diverge from previously-studied languages when fewer phonotactic constraints are retained. We conclude that lexical network properties are representative of an underlying cognitive structure which is necessary for efficient word retrieval and that the phonotactics of !Xung may be shaped by a selective pressure which preserves network properties within this cognitively useful range. inproceedings hussain-elsner-miller:2018:SIGMORPHON Acoustic Word Disambiguation with Phonogical Features in Danish ASR Andreas SøeborgKirkedal TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 21–31 http://www.aclweb.org/anthology/W18-5803 Phonological features can indicate word class and we can use word class information to disambiguate both homophones and homographs in automatic speech recognition (ASR). inproceedings kirkedal:2018:SIGMORPHON Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages PierreGodard LaurentBesacier FrançoisYvon MartineAdda-Decker GillesAdda HélèneMaynard AnnieRialland TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 32–42 http://www.aclweb.org/anthology/W18-5804 Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline. inproceedings godard-EtAl:2018:SIGMORPHON String Transduction with Target Language Models and Insertion Handling GarrettNicolai SaeedNajafi GrzegorzKondrak TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 43–53 http://www.aclweb.org/anthology/W18-5805 Many character-level tasks can be framed as inproceedings nicolai-najafi-kondrak:2018:SIGMORPHON Complementary Strategies for Low Resourced Morphological Modeling AlexanderErdmann NizarHabash TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 54–65 http://www.aclweb.org/anthology/W18-5806 Morphologically rich languages are challenging for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data sparsity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in addition to being morphologically complex, features ubiquitous ambiguity, exacerbating sparsity with noise. Given a small number of out-of-context rules describing closed class morphology, we combine them with word embeddings leveraging subword strings and noise reduction techniques. The combination outperforms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and inproceedings erdmann-habash:2018:SIGMORPHON Modeling Reduplication with 2-way Finite-State Transducers HossepDolatian JeffreyHeinz TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 66–77 http://www.aclweb.org/anthology/W18-5807 This article describes a novel approach to the computational modeling of reduplication. Reduplication is a well-studied linguistic phenomenon. However, it is often treated as a stumbling block within finite-state treatments of morphology. Most finite-state implementations of computational morphology cannot adequately capture the productivity of unbounded copying in reduplication, nor can they adequately capture bounded copying. We show that an understudied type of finite-state machines, two-way finite-state transducers (2-way FSTs), captures virtually all reduplicative processes, including total reduplication. 2-way FSTs can model reduplicative typology in a way which is convenient, easy to design and debug in practice, and linguistically-motivated. By virtue of being finite-state, 2-way FSTs are likewise incorporable into existing finite-state systems and programs. A small but representative typology of reduplicative processes is described in this article, alongside their corresponding 2-way FST models. inproceedings dolatian-heinz:2018:SIGMORPHON Automatically Tailoring Unsupervised Morphological Segmentation to the Language RamyEskander OwenRambow SmarandaMuresan TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 78–83 http://www.aclweb.org/anthology/W18-5808 Morphological segmentation is beneficial for several natural language processing tasks dealing with large vocabularies. Unsupervised methods for morphological segmentation are essential for handling a diverse set of languages, including low-resource languages. Eskander et al. (2016) introduced a Language Independent Morphological Segmenter (LIMS) using Adaptor Grammars (AG) based on the best-on-average performing AG configuration. However, while LIMS worked best on average and outperforms other state-of-the-art unsupervised morphological segmentation approaches, it did not provide the optimal AG configuration for five out of the six languages. We propose two language-independent classifiers that enable the selection of the optimal or nearly-optimal configuration for the morphological segmentation of unseen languages. inproceedings eskander-rambow-muresan:2018:SIGMORPHON A Comparison of Entity Matching Methods between English and Japanese Katakana MichiharuYamashita HidekiAwashima HidekazuOiwa TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 84–92 http://www.aclweb.org/anthology/W18-5809 Japanese Katakana is one component of the Japanese writing system and is used to express English terms, loanwords, and onomatopoeia in Japanese characters based on the phonemes. The main purpose of this research is to find the best entity matching methods between English and Katakana. We built two research questions to clarify which types of entity matching systems works better than others. The first question is what transliteration should be used for conversion. We need to transliterate English or Katakana terms into the same form in order to compute the string similarity. We consider five conversions that transliterate English to Katakana directly, Katakana to English directly, English to Katakana via phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. The second question is what should be used for the similarity measure at entity matching. To investigate the problem, we choose six methods, which are Overlap Coefficient, Cosine, Jaccard, Jaro-Winkler, Levenshtein, and the similarity of the phoneme probability predicted by RNN. Our results show that 1) matching using phonemes and conversion of Katakana to English works better than other methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models. inproceedings yamashita-awashima-oiwa:2018:SIGMORPHON Seq2Seq Models with Dropout can Learn Generalizable Reduplication BrandonPrickett AaronTraylor JoePater TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 93–100 http://www.aclweb.org/anthology/W18-5810 Natural language reduplication can pose a challenge to neural models of language, and has been argued to require variables (Marcus et al., 1999). Sequence-to-sequence neural networks have been shown to perform well at a number of other morphological tasks (Cotterell et al., 2016), and produce results that highly correlate with human behavior (Kirov, 2017; Kirov & Cotterell, 2018) but do not include any explicit variables in their architecture. We find that they can learn a reduplicative pattern that generalizes to novel segments if they are trained with dropout (Srivastava et al., 2014). We argue that this matches the scope of generalization observed in human reduplication. inproceedings prickett-traylor-pater:2018:SIGMORPHON A Characterwise Windowed Approach to Hebrew Morphological Segmentation AmirZeldes TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 101–110 http://www.aclweb.org/anthology/W18-5811 This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ~4% and 5% over previous state of the art performance. inproceedings zeldes:2018:SIGMORPHON Phonetic Vector Representations for Sound Sequence Alignment PavelSofroniev ÇağrıÇöltekin TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 111–116 http://www.aclweb.org/anthology/W18-5812 This study explores a number of data-driven vector representations of the inproceedings sofroniev-ltekin:2018:SIGMORPHON Sounds Wilde. Phonetically Extended Embeddings for Author-Stylized Poetry Generation AlekseyTikhonov IvanYamshchikov TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 117–124 http://www.aclweb.org/anthology/W18-5813 This paper addresses author-stylized text generation. Using a version of a language model with extended phonetic and semantic embeddings for poetry generation we show that phonetics has comparable contribution to the overall model performance as the information on the target author. Phonetic information is shown to be important for English and Russian language. Humans tend to attribute machine generated texts to the target author. inproceedings tikhonov-yamshchikov:2018:SIGMORPHON On Hapax Legomena and Morphological Productivity JanetPierrehumbert RamonGranell TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 125–130 http://www.aclweb.org/anthology/W18-5814 Quantifying and predicting morphological productivity is a long-standing challenge in corpus linguistics and psycholinguistics. The same challenge reappears in natural language processing in the context of handling words that were not seen in the training set (out-of-vocabulary, or OOV, words). Prior research showed that a good indicator of the productivity of a morpheme is the number of words involving it that occur exactly once (the hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated. inproceedings pierrehumbert-granell:2018:SIGMORPHON A Morphological Analyzer for Shipibo-Konibo RonaldCardenas DanielZeman TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 131–139 http://www.aclweb.org/anthology/W18-5815 We present a morphological analyzer for Shipibo-Konibo, a low-resourced native language spoken in the Amazonian region of Peru. inproceedings cardenas-zeman:2018:SIGMORPHON An Arabic Morphological Analyzer and Generator with Copious Features DimaTaji SalamKhalifa OssamaObeid FadhlEryani NizarHabash TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 140–150 http://www.aclweb.org/anthology/W18-5816 We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more. This tool includes a fast engine that can be easily integrated into other systems, as well as an easy-to-use API and a web interface. CALIMA-Star also supports morphological reinflection. We evaluate CALIMA-Star against four commonly used analyzers for Arabic in terms of speed and morphological content. inproceedings taji-EtAl:2018:SIGMORPHON Sanskrit n-Retroflexion is Input-Output Tier-Based Strictly Local ThomasGraf ConnorMayer TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 151–160 http://www.aclweb.org/anthology/W18-5817 Sanskrit /n/-retroflexion (nati) is one of the most complex segmental processes in phonology. While it is still star-free, it does not fit in any of the subregular classes that are commonly entertained in the literature. We show that when construed as a phonotactic dependency, the process fits into a class we call input-output tier-based strictly local (IO-TSL), a natural extension of the familiar class TSL. IO-TSL increases the power of TSL's tier projection function by making it an input-output strictly local transduction. Assuming that /n/-retroflexion represents the upper bound on the complexity of segmental phonology, this shows that all of segmental phonology can be captured by combining the intuitive notion of tiers with the independently motivated machinery of strictly local mappings. inproceedings graf-mayer:2018:SIGMORPHON Phonological Features for Morphological Inflection AdamWiemerslage MiikkaSilfverberg MansHulden TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 161–166 http://www.aclweb.org/anthology/W18-5818 Modeling morphological inflection is an important task in Natural Language Processing. In contrast to earlier work that has largely used orthographic representations, we experiment with this task in a phonetic character space, representing inputs as either IPA segments or bundles of phonological distinctive features. We show that both of these inputs, somewhat counterintuitively, achieve similar accuracies on morphological inflection, slightly lower than orthographic models. We conclude that providing detailed phonological representations is largely redundant when compared to IPA segments, and that articulatory distinctions relevant for word inflection are already latently present in the distributional properties of many graphemic writing systems. inproceedings wiemerslage-silfverberg-hulden:2018:SIGMORPHON Extracting Morphophonology from Small Corpora MarinaErmolaeva TOBEFILLED-Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology October 2018

Brussels, Belgium

Association for Computational Linguistics 167–175 http://www.aclweb.org/anthology/W18-5819 Probabilistic approaches have proven themselves well in learning phonological structure. In contrast, theoretical linguistics usually works with deterministic generalizations. The goal of this paper is to explore possible interactions between information-theoretic methods and deterministic linguistic knowledge and to examine some ways in which both can be used in tandem to extract phonological and morphophonological patterns from a small annotated dataset. Local and nonlocal processes in Mishar Tatar (Turkic/Kipchak) are examined as a case study. inproceedings ermolaeva:2018:SIGMORPHON