Proceedings of the Workshop on Speech-Centric Natural Language Processing

Proceedings of the Workshop on Speech-Centric Natural Language Processing Nicholas Ruiz Srinivas Bangalore September 2017

Copenhagen, Denmark

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-46 book Speech-Centric:2017 Functions of Silences towards Information Flow in Spoken Conversation Shammur AbsarChowdhury EvgenyStepanov MorenaDanieli GiuseppeRiccardi Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 1–9 http://www.aclweb.org/anthology/W17-4601 Silence is an integral part of the most frequent turn-taking phenomena in spoken conversations. Silence is sized and placed within the conversation flow and it is coordinated by the speakers along with the other speech acts. The objective of this analytical study is twofold: to explore the functions of silence with duration of one second and above, towards information flow in a dyadic conversation utilizing the sequences of dialog acts present in the turns surrounding the silence itself; and to design a feature space useful for clustering the silences using a hierarchical concept formation algorithm. The resulting clusters are manually grouped into functional categories based on their similarities. It is observed that the silence plays an important role in response preparation, also can indicate speakers' hesitation or indecisiveness. It is also observed that sometimes long silences can be used deliberately to get a forced response from another speaker thus making silence a multi-functional and an important catalyst towards information flow. inproceedings chowdhury-EtAl:2017:Speech-Centric Encoding Word Confusion Networks with Recurrent Neural Networks for Dialog State Tracking GloriannaJagfeld Ngoc ThangVu Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 10–17 http://www.aclweb.org/anthology/W17-4602 This paper presents our novel method to encode word confusion networks, which can represent a rich hypothesis space of automatic speech recognition systems, via recurrent neural networks. We demonstrate the utility of our approach for the task of dialog state tracking in spoken dialog systems that relies on automatic speech recognition output. Encoding confusion networks outperforms encoding the best hypothesis of the automatic speech recognition in a neural system for dialog state tracking on the well-known second Dialog State Tracking Challenge dataset. inproceedings jagfeld-vu:2017:Speech-Centric Analyzing Human and Machine Performance In Resolving Ambiguous Spoken Sentences HusseinGhaly MichaelMandel Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 18–26 http://www.aclweb.org/anthology/W17-4603 W17-4603.Attachment.zip Written sentences can be more ambiguous than spoken sentences. We investigate this difference for two different types of ambiguity: prepositional phrase (PP) attachment and sentences where the addition of commas changes the meaning. We recorded a native English speaker saying several of each type of sentence both with and without disambiguating contextual information. These sentences were then presented either as text or audio and either with or without context to subjects who were asked to select the proper interpretation of the sentence. Results suggest that comma-ambiguous sentences are easier to disambiguate than PP-attachment-ambiguous sentences, possibly due to the presence of clear prosodic boundaries, namely silent pauses. Subject performance for sentences with PP-attachment ambiguity without context was 52% for text only while it was 72.4% for audio only, suggesting that audio has more disambiguating information than text. Using an analysis of acoustic features of two PP-attachment sentences, a simple classifier was implemented to resolve the PP-attachment ambiguity being early or late closure with a mean accuracy of 80%. inproceedings ghaly-mandel:2017:Speech-Centric Parsing transcripts of speech AndrewCaines MichaelMcCarthy PaulaButtery Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 27–36 http://www.aclweb.org/anthology/W17-4604 We present an analysis of parser performance on speech data, comparing word type and token frequency distributions with written data, and evaluating parse accuracy by length of input string. We find that parser performance tends to deteriorate with increasing length of string, more so for spoken than for written texts. We train an alternative parsing model with added speech data and demonstrate improvements in accuracy on speech-units, with no deterioration in performance on written text. inproceedings caines-mccarthy-buttery:2017:Speech-Centric Enriching ASR Lattices with POS Tags for Dependency Parsing MoritzStiefel Ngoc ThangVu Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 37–47 http://www.aclweb.org/anthology/W17-4605 Parsing speech requires a richer representation than 1-best or n-best hypotheses, e.g. lattices. Moreover, previous work shows that part-of-speech (POS) tags are a valuable resource for parsing. In this paper, we therefore explore a joint modeling approach of automatic speech recognition (ASR) and POS tagging to enrich ASR word lattices. To that end, we manipulate the ASR process from the pronouncing dictionary onward to use word-POS pairs instead of words. We evaluate ASR, POS tagging and dependency parsing (DP) performance demonstrating a successful lattice-based integration of ASR and POS tagging. inproceedings stiefel-vu:2017:Speech-Centric End-to-End Information Extraction without Token-Level Supervision Rasmus BergPalm DirkHovy FlorianLaws OleWinther Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 48–52 http://www.aclweb.org/anthology/W17-4606 Most state-of-the-art information extraction approaches rely on token-level labels to find the areas of interest in text. Unfortunately, these labels are time-consuming and costly to create, and consequently, not available for many real-life IE tasks. To make matters worse, token-level labels are usually not the desired output, but just an intermediary step. End-to-end (E2E) models, which take raw text as input and produce the desired output directly, need not depend on token-level labels. We propose an E2E model based on pointer networks, which can be trained directly on pairs of raw input and output text. We evaluate our model on the ATIS data set, MIT restaurant corpus and the MIT movie corpus and compare to neural baselines that do use token-level labels. We achieve competitive results, within a few percentage points of the baselines, showing the feasibility of E2E information extraction without the need for token-level labels. This opens up new possibilities, as for many tasks currently addressed by human extractors, raw input and output data are available, but not token-level labels. inproceedings palm-EtAl:2017:Speech-Centric Spoken Term Discovery for Language Documentation using Translations AntoniosAnastasopoulos SameerBansal DavidChiang SharonGoldwater AdamLopez Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 53–58 http://www.aclweb.org/anthology/W17-4607 Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario. inproceedings anastasopoulos-EtAl:2017:Speech-Centric Amharic-English Speech Translation in Tourism Domain MichaelMelese LaurentBesacier MillionMeshesha Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 59–66 http://www.aclweb.org/anthology/W17-4608 This paper describes speech translation from Amharic-to-English, particularly Automatic Speech Recognition (ASR) with post-editing feature and Amharic-English Statistical Machine Translation (SMT). ASR experiment is conducted using morpheme language model (LM) and phoneme acoustic model(AM). Likewise,SMT conducted using word and morpheme as unit. Morpheme based translation shows a 6.29 BLEU score at a 76.4% of recognition accuracy while word based translation shows a 12.83 BLEU score using 77.4% word recognition accuracy. Further, after post-edit on Amharic ASR using corpus based n-gram, the word recognition accuracy increased by 1.42%. Since post-edit approach reduces error propagation, the word based translation accuracy improved by 0.25 (1.95%) BLEU score. We are now working towards further improving propagated errors through different algorithms at each unit of speech translation cascading component. inproceedings melese-besacier-meshesha:2017:Speech-Centric Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks AnastassiaLoukina NitinMadnani AoifeCahill Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 67–77 http://www.aclweb.org/anthology/W17-4609 We consider the automatic scoring of a task for which both the content of the response as well its spoken fluency are important. We combine features from a text-only content scoring system originally designed for written responses with several categories of acoustic features. Although adding any single category of acoustic features to the text-only system on its own does not significantly improve performance, adding all acoustic features together does yield a small but significant improvement. These results are consistent for responses to open-ended questions and to questions focused on some given source material. inproceedings loukina-madnani-cahill:2017:Speech-Centric Improving coreference resolution with automatically predicted prosodic information InaRoesiger SabrinaStehwien ArndtRiester Ngoc ThangVu Proceedings of the Workshop on Speech-Centric Natural Language Processing September 2017

Copenhagen, Denmark

Association for Computational Linguistics 78–83 http://www.aclweb.org/anthology/W17-4610 Adding manually annotated prosodic information, specifically pitch accents and phrasing, to the typical text-based feature set for coreference resolution has previously been shown to have a positive effect on German data. Practical applications on spoken language, however, would rely on automatically predicted prosodic information. In this paper we predict pitch accents (and phrase boundaries) using a convolutional neural network (CNN) model from acoustic features extracted from the speech signal. After an assessment of the quality of these automatic prosodic annotations, we show that they also significantly improve coreference resolution. inproceedings roesiger-EtAl:2017:Speech-Centric