<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="4600">
    <title>Proceedings of the Workshop on Speech-Centric Natural Language Processing</title>
    <editor>Nicholas Ruiz</editor>
    <editor>Srinivas Bangalore</editor>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-46</url>
    <bibtype>book</bibtype>
    <bibkey>Speech-Centric:2017</bibkey>
  </paper>

  <paper id="4601">
    <title>Functions of Silences towards Information Flow in Spoken Conversation</title>
    <author><first>Shammur Absar</first><last>Chowdhury</last></author>
    <author><first>Evgeny</first><last>Stepanov</last></author>
    <author><first>Morena</first><last>Danieli</last></author>
    <author><first>Giuseppe</first><last>Riccardi</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;9</pages>
    <url>http://www.aclweb.org/anthology/W17-4601</url>
    <abstract>Silence is an integral part of the most frequent turn-taking phenomena in
	spoken conversations.  Silence is sized and placed within the conversation flow
	and it is coordinated by the speakers along with the other speech acts. The
	objective of this analytical study is twofold: to explore the functions of
	silence with duration of one second and above, towards information flow in a
	dyadic conversation utilizing the sequences of dialog acts present in the turns
	surrounding the silence itself; and to design a feature space useful for
	clustering the silences using a hierarchical concept formation algorithm. The
	resulting clusters are manually grouped into functional categories based on
	their similarities. It is observed that the silence plays an important role in
	response preparation, also can indicate speakers' hesitation or indecisiveness.
	It is also observed that sometimes long silences can be used deliberately to
	get a forced response from another speaker thus making silence a
	multi-functional and an important catalyst towards information flow.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>chowdhury-EtAl:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4602">
    <title>Encoding Word Confusion Networks with Recurrent Neural Networks for Dialog State Tracking</title>
    <author><first>Glorianna</first><last>Jagfeld</last></author>
    <author><first>Ngoc Thang</first><last>Vu</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>10&#8211;17</pages>
    <url>http://www.aclweb.org/anthology/W17-4602</url>
    <abstract>This paper presents our novel method to encode word confusion networks, which
	can represent a rich hypothesis space of automatic speech recognition systems,
	via recurrent neural networks.
	We demonstrate the utility of our approach for the task of dialog state
	tracking in spoken dialog systems that relies on automatic speech recognition
	output.
	Encoding confusion networks outperforms encoding the best hypothesis of the
	automatic speech recognition in a neural system for dialog state tracking on
	the well-known second Dialog State Tracking Challenge dataset.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jagfeld-vu:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4603">
    <title>Analyzing Human and Machine Performance In Resolving Ambiguous Spoken Sentences</title>
    <author><first>Hussein</first><last>Ghaly</last></author>
    <author><first>Michael</first><last>Mandel</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>18&#8211;26</pages>
    <url>http://www.aclweb.org/anthology/W17-4603</url>
    <attachment type="attachment">W17-4603.Attachment.zip</attachment>
    <abstract>Written sentences can be more ambiguous than spoken sentences. We investigate
	this difference for two different types of ambiguity: prepositional phrase (PP)
	attachment and sentences where the addition of commas changes the meaning. We
	recorded a native English speaker saying several of each type of sentence both
	with and without disambiguating contextual information.  These sentences were
	then presented either as text or audio and either with or without context to
	subjects who were asked to select the proper interpretation of the sentence.
	Results suggest that comma-ambiguous sentences are easier to disambiguate than
	PP-attachment-ambiguous sentences, possibly due to the presence of clear
	prosodic boundaries, namely silent pauses. Subject performance for sentences
	with PP-attachment ambiguity without context was 52% for text only while it was
	72.4% for audio only, suggesting that audio has more disambiguating information
	than text. Using an analysis of acoustic features of two PP-attachment
	sentences, a simple classifier was implemented to resolve the PP-attachment
	ambiguity being early or late closure with a mean accuracy of 80%.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ghaly-mandel:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4604">
    <title>Parsing transcripts of speech</title>
    <author><first>Andrew</first><last>Caines</last></author>
    <author><first>Michael</first><last>McCarthy</last></author>
    <author><first>Paula</first><last>Buttery</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>27&#8211;36</pages>
    <url>http://www.aclweb.org/anthology/W17-4604</url>
    <abstract>We present an analysis of parser performance on speech data, comparing word
	type and token frequency distributions with written data, and evaluating parse
	accuracy by length of input string. We find that parser performance tends to
	deteriorate with increasing length of string, more so for spoken than for
	written texts. We train an alternative parsing model with added speech data and
	demonstrate improvements in accuracy on speech-units, with no deterioration in
	performance on written text.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>caines-mccarthy-buttery:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4605">
    <title>Enriching ASR Lattices with POS Tags for Dependency Parsing</title>
    <author><first>Moritz</first><last>Stiefel</last></author>
    <author><first>Ngoc Thang</first><last>Vu</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>37&#8211;47</pages>
    <url>http://www.aclweb.org/anthology/W17-4605</url>
    <abstract>Parsing speech requires a richer representation than 1-best or n-best
	hypotheses, e.g. lattices. Moreover, previous work shows that part-of-speech
	(POS) tags are a valuable resource for parsing. In this paper, we therefore
	explore a joint modeling approach of automatic speech recognition (ASR) and POS
	tagging to enrich ASR word lattices. To that end, we manipulate the ASR process
	from the pronouncing dictionary onward to use word-POS pairs instead of words.
	We evaluate ASR, POS tagging and dependency parsing (DP) performance
	demonstrating a successful lattice-based integration of ASR and POS tagging.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>stiefel-vu:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4606">
    <title>End-to-End Information Extraction without Token-Level Supervision</title>
    <author><first>Rasmus Berg</first><last>Palm</last></author>
    <author><first>Dirk</first><last>Hovy</last></author>
    <author><first>Florian</first><last>Laws</last></author>
    <author><first>Ole</first><last>Winther</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>48&#8211;52</pages>
    <url>http://www.aclweb.org/anthology/W17-4606</url>
    <abstract>Most state-of-the-art information extraction approaches rely on token-level
	labels to find the areas of interest in text. Unfortunately, these labels are
	time-consuming and costly to create, and consequently, not available for many
	real-life IE tasks. To make matters worse, token-level labels are usually not
	the desired output, but just an intermediary step. 
	End-to-end (E2E) models, which take raw text as input and produce the desired
	output directly, need not depend on token-level labels. 
	We propose an E2E model based on pointer networks, which can be trained
	directly on pairs of raw input and output text.
	We evaluate our model on the ATIS data set, MIT restaurant corpus and the MIT
	movie corpus and compare to neural baselines that do use token-level labels. We
	achieve competitive results, within a few percentage points of the baselines,
	showing the feasibility of E2E information extraction without the need for
	token-level labels.
	This opens up new possibilities, as for many tasks currently addressed by human
	extractors, raw input and output data are available, but not token-level
	labels.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>palm-EtAl:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4607">
    <title>Spoken Term Discovery for Language Documentation using Translations</title>
    <author><first>Antonios</first><last>Anastasopoulos</last></author>
    <author><first>Sameer</first><last>Bansal</last></author>
    <author><first>David</first><last>Chiang</last></author>
    <author><first>Sharon</first><last>Goldwater</last></author>
    <author><first>Adam</first><last>Lopez</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>53&#8211;58</pages>
    <url>http://www.aclweb.org/anthology/W17-4607</url>
    <abstract>Vast amounts of speech data collected for language documentation and research
	remain untranscribed and unsearchable, but often a small amount of speech may
	have text translations available. We present a method for partially labeling
	additional speech with translations in this scenario. We modify an unsupervised
	speech-to-translation alignment model and obtain prototype speech segments that
	match the translation words, which are in turn used to discover terms in the
	unlabelled data. We evaluate our method on a Spanish-English speech translation
	corpus and on two corpora of endangered languages, Arapaho and Ainu,
	demonstrating its appropriateness and applicability in an actual
	very-low-resource scenario.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>anastasopoulos-EtAl:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4608">
    <title>Amharic-English Speech Translation in Tourism Domain</title>
    <author><first>Michael</first><last>Melese</last></author>
    <author><first>Laurent</first><last>Besacier</last></author>
    <author><first>Million</first><last>Meshesha</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>59&#8211;66</pages>
    <url>http://www.aclweb.org/anthology/W17-4608</url>
    <abstract>This paper describes speech translation from Amharic-to-English, particularly
	Automatic Speech Recognition (ASR) with post-editing feature and
	Amharic-English Statistical Machine Translation (SMT). ASR experiment is
	conducted using morpheme language model (LM) and phoneme acoustic model(AM).
	Likewise,SMT conducted using word and morpheme as unit. 
	Morpheme based translation shows a 6.29 BLEU score at a 76.4% of recognition
	accuracy while word based translation shows a 12.83 BLEU score using 77.4% word
	recognition accuracy. Further, after post-edit on Amharic ASR using corpus
	based n-gram, the word recognition accuracy increased by 1.42%. Since post-edit
	approach reduces error propagation, the word based translation accuracy
	improved by 0.25 (1.95%) BLEU score. We are now working towards further
	improving propagated errors through different algorithms at each unit of speech
	translation cascading component.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>melese-besacier-meshesha:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4609">
    <title>Speech- and Text-driven Features for Automated Scoring of English Speaking Tasks</title>
    <author><first>Anastassia</first><last>Loukina</last></author>
    <author><first>Nitin</first><last>Madnani</last></author>
    <author><first>Aoife</first><last>Cahill</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>67&#8211;77</pages>
    <url>http://www.aclweb.org/anthology/W17-4609</url>
    <abstract>We consider the automatic scoring of a task for which both the content of the
	response as well its spoken fluency are important. We combine features from a
	text-only content scoring system originally designed for written responses with
	several categories of acoustic features. Although adding any single category of
	acoustic features to the text-only system on its own does not significantly
	improve performance, adding all acoustic features together does yield a small
	but significant improvement. These results are consistent for responses to
	open-ended questions and to questions focused on some given source material.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>loukina-madnani-cahill:2017:Speech-Centric</bibkey>
  </paper>

  <paper id="4610">
    <title>Improving coreference resolution with automatically predicted prosodic information</title>
    <author><first>Ina</first><last>Roesiger</last></author>
    <author><first>Sabrina</first><last>Stehwien</last></author>
    <author><first>Arndt</first><last>Riester</last></author>
    <author><first>Ngoc Thang</first><last>Vu</last></author>
    <booktitle>Proceedings of the Workshop on Speech-Centric Natural Language Processing</booktitle>
    <month>September</month>
    <year>2017</year>
    <address>Copenhagen, Denmark</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>78&#8211;83</pages>
    <url>http://www.aclweb.org/anthology/W17-4610</url>
    <abstract>Adding manually annotated prosodic information, specifically pitch accents and
	phrasing, to the typical text-based feature set for coreference resolution has
	previously been shown to have a positive effect on German data. Practical
	applications on spoken language, however, would rely on automatically predicted
	prosodic information. In this paper we predict pitch accents (and phrase
	boundaries) using a convolutional neural network (CNN) model from acoustic
	features extracted from the speech signal. After an assessment of the quality
	of these automatic prosodic annotations, we show that they also significantly
	improve coreference resolution.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>roesiger-EtAl:2017:Speech-Centric</bibkey>
  </paper>

</volume>

