<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="2500">
    <title>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</title>
    <editor><first>Serge</first><last>Sharoff</last></editor>
    <editor><first>Pierre</first><last>Zweigenbaum</last></editor>
    <editor><first>Reinhard</first><last>Rapp</last></editor>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-25</url>
    <bibtype>book</bibtype>
    <bibkey>BUCC:2017</bibkey>
  </paper>

  <paper id="2501">
    <title>Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research</title>
    <author><first>Phillippe</first><last>Langlais</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;5</pages>
    <url>http://www.aclweb.org/anthology/W17-2501</url>
    <abstract>Despite numerous studies devoted to mining parallel material from bilingual
	data, we have yet to see the resulting technologies wholeheartedly adopted by
	professional translators and terminologists alike. I argue that this state of
	affairs is mainly due to two factors: the emphasis published authors put on
	models (even though data is as important), and the conspicuous lack of concern
	for actual end-users.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>langlais:2017:BUCC</bibkey>
  </paper>

  <paper id="2502">
    <title>Deep Investigation of Cross-Language Plagiarism Detection Methods</title>
    <author><first>J&#233;r&#233;my</first><last>Ferrero</last></author>
    <author><first>Laurent</first><last>Besacier</last></author>
    <author><first>Didier</first><last>Schwab</last></author>
    <author><first>Fr&#233;d&#233;ric</first><last>Agn&#232;s</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>6&#8211;15</pages>
    <url>http://www.aclweb.org/anthology/W17-2502</url>
    <attachment type="presentation">W17-2502.Presentation.pdf</attachment>
    <abstract>This paper is a deep investigation of cross-language plagiarism detection
	methods on a new recently introduced open dataset, which contains parallel and
	com- parable collections of documents with multiple characteristics (different
	genres, languages and sizes of texts). We investigate cross-language plagiarism
	detection methods for 6 language pairs on 2 granularities of text units in
	order to draw robust conclusions on the best methods while deeply analyzing
	correlations across document styles and languages.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ferrero-EtAl:2017:BUCC</bibkey>
  </paper>

  <paper id="2503">
    <title>Sentence Alignment using Unfolding Recursive Autoencoders</title>
    <author><first>Jeenu</first><last>Grover</last></author>
    <author><first>Pabitra</first><last>Mitra</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>16&#8211;20</pages>
    <url>http://www.aclweb.org/anthology/W17-2503</url>
    <abstract>In this paper, we propose a novel two step algorithm for sentence alignment in
	monolingual corpora using Unfolding Recursive Autoencoders. First, we use
	unfolding recursive auto-encoders (RAE) to learn feature vectors for phrases in
	syntactical tree of the sentence. To compare two sentences we use a similarity
	matrix which has dimensions proportional to the size of the two sentences.
	Since the similarity matrix generated to compare two sentences has varying
	dimension due to different sentence lengths, a dynamic pooling layer is used to
	map it to a matrix of fixed dimension. The resulting matrix is used to
	calculate the similarity scores between the two sentences. The second step of
	the algorithm captures the contexts in which the sentences occur in the
	document by using a dynamic programming algorithm for global alignment.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>grover-mitra:2017:BUCC</bibkey>
  </paper>

  <paper id="2504">
    <title>Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords</title>
    <author><first>Michael</first><last>Bloodgood</last></author>
    <author><first>Benjamin</first><last>Strauss</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>21&#8211;25</pages>
    <url>http://www.aclweb.org/anthology/W17-2504</url>
    <attachment type="presentation">W17-2504.Presentation.pdf</attachment>
    <abstract>With the advent of informal electronic communications such as social media,
	colloquial languages that were historically unwritten are being written for the
	first time in heavily code-switched environments. We present a method for
	inducing portions of translation lexicons through the use of expert knowledge
	in these settings where there are approximately zero resources available other
	than a language informant, potentially not even large amounts of monolingual
	data. We investigate inducing a Moroccan Darija-English translation lexicon via
	French loanwords bridging into English and find that a useful lexicon is
	induced for human-assisted translation and statistical machine translation.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bloodgood-strauss:2017:BUCC</bibkey>
  </paper>

  <paper id="2505">
    <title>Toward a Comparable Corpus of Latvian, Russian and English Tweets</title>
    <author><first>Dmitrijs</first><last>Milajevs</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>26&#8211;30</pages>
    <url>http://www.aclweb.org/anthology/W17-2505</url>
    <abstract>Twitter has become a rich source for linguistic data. Here, a possibility of
	building a trilingual Latvian-Russian-English corpus of tweets from Riga,
	Latvia is investigated. Such a corpus, once constructed, might be of great use
	for multiple purposes including  training machine translation models, examining
	cross-lingual phenomena and studying the population of Riga. This pilot study
	shows that it is feasible to build such a resource by collecting and analysing
	a pilot corpus, which is made publicly available and can be used to construct a
	large comparable corpus.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>milajevs:2017:BUCC</bibkey>
  </paper>

  <paper id="2506">
    <title>Automatic Extraction of Parallel Speech Corpora from Dubbed Movies</title>
    <author><first>Alp</first><last>&#214;ktem</last></author>
    <author><first>Mireia</first><last>Farr&#250;s</last></author>
    <author><first>Leo</first><last>Wanner</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>31&#8211;35</pages>
    <url>http://www.aclweb.org/anthology/W17-2506</url>
    <attachment type="presentation">W17-2506.Presentation.pdf</attachment>
    <abstract>This paper presents a methodology to extract parallel speech corpora based on
	any language pair from dubbed movies, together with an application framework in
	which some corresponding prosodic parameters are extracted. The obtained
	parallel corpora are especially suitable for speech-to-speech translation
	applications when a prosody transfer between source and target languages is
	desired.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>oktem-farrus-wanner:2017:BUCC</bibkey>
  </paper>

  <paper id="2507">
    <title>A parallel collection of clinical trials in Portuguese and English</title>
    <author><first>Mariana</first><last>Neves</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>36&#8211;40</pages>
    <url>http://www.aclweb.org/anthology/W17-2507</url>
    <attachment type="presentation">W17-2507.Presentation.pdf</attachment>
    <abstract>Parallel collections of documents are crucial resources for training and
	evaluating
	machine translation (MT) systems. Even though large collections are available
	for
	certain domains and language pairs, these are still scarce in the biomedical
	domain. We developed a parallel corpus of clinical trials in Portuguese and
	English. The documents are derived from the Brazilian Clinical Trials Registry
	and the corpus currently contains a total of 1188 documents. In this paper, we
	describe the corpus construction and discuss the quality of the translation and
	the sentence alignment that we obtained.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>neves:2017:BUCC</bibkey>
  </paper>

  <paper id="2508">
    <title>Weighted Set-Theoretic Alignment of Comparable Sentences</title>
    <author><first>Andoni</first><last>Azpeitia</last></author>
    <author><first>Thierry</first><last>Etchegoyhen</last></author>
    <author><first>Eva</first><last>Mart&#237;nez Garcia</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>41&#8211;45</pages>
    <url>http://www.aclweb.org/anthology/W17-2508</url>
    <abstract>This article presents the STACCw system for the BUCC 2017 shared task on
	parallel sentence extraction from comparable corpora. The original STACC
	approach, based on set-theoretic operations over bags of words, had been
	previously shown to be efficient and portable across domains and alignment
	scenarios. Wedescribe an extension of this approach with a new weighting scheme
	and show that it provides significant improvements on the datasets provided for
	the shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>azpeitia-etchegoyhen-martinezgarcia:2017:BUCC</bibkey>
  </paper>

  <paper id="2509">
    <title>BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora</title>
    <author><first>Francis</first><last>Gr&#233;goire</last></author>
    <author><first>Philippe</first><last>Langlais</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>46&#8211;50</pages>
    <url>http://www.aclweb.org/anthology/W17-2509</url>
    <abstract>This paper describes our participation in BUCC 2017 shared task:
	identifying parallel sentences in comparable corpora. Our goal is to leverage
	continuous vector representations and distributional semantics with a minimal
	use of external preprocessing and postprocessing tools. We report experiments
	that were conducted after transmitting our results.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gregoire-langlais:2017:BUCC</bibkey>
  </paper>

  <paper id="2510">
    <title>zNLP: Identifying Parallel Sentences in Chinese-English Comparable Corpora</title>
    <author><first>Zheng</first><last>Zhang</last></author>
    <author><first>Pierre</first><last>Zweigenbaum</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>51&#8211;55</pages>
    <url>http://www.aclweb.org/anthology/W17-2510</url>
    <abstract>This paper describes the zNLP system for the BUCC 2017 shared task. Our system
	identifies parallel sentence pairs in Chinese-English comparable corpora by
	translating word-by-word Chinese sentences into English,
	using the search engine Solr to select near-parallel sentences and then by
	using an SVM classifier to identify true parallel sentences from the previous
	results. It obtains an F1-score of 45% (resp. 32%) on the test (training) set.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zhang-zweigenbaum:2017:BUCC</bibkey>
  </paper>

  <paper id="2511">
    <title>BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora</title>
    <author><first>Sainik</first><last>Mahata</last></author>
    <author><first>Dipankar</first><last>Das</last></author>
    <author><first>Sivaji</first><last>Bandyopadhyay</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>56&#8211;59</pages>
    <url>http://www.aclweb.org/anthology/W17-2511</url>
    <abstract>A Statistical Machine Translation (SMT) system is always trained using large
	parallel corpus to produce effective translation. Not only is the corpus
	scarce, it also involves a lot of manual labor and cost. Parallel corpus can be
	prepared by employing comparable corpora where a pair of corpora is in two
	different languages pointing to the same domain. In the present work, we try to
	build a parallel corpus for French-English language pair from a given
	comparable corpus. The data and the problem set are provided as part of the
	shared task organized by BUCC 2017. We have proposed a system that first
	translates the sentences by heavily relying on Moses and then group the
	sentences based on sentence length similarity. Finally, the one to one sentence
	selection was done based on Cosine Similarity algorithm.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>mahata-das-bandyopadhyay:2017:BUCC</bibkey>
  </paper>

  <paper id="2512">
    <title>Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora</title>
    <author><first>Pierre</first><last>Zweigenbaum</last></author>
    <author><first>Serge</first><last>Sharoff</last></author>
    <author><first>Reinhard</first><last>Rapp</last></author>
    <booktitle>Proceedings of the 10th Workshop on Building and Using Comparable Corpora</booktitle>
    <month>August</month>
    <year>2017</year>
    <address>Vancouver, Canada</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>60&#8211;67</pages>
    <url>http://www.aclweb.org/anthology/W17-2512</url>
    <abstract>This paper presents the BUCC 2017 shared task on parallel sentence extraction
	from comparable corpora.  It recalls the design of the datasets, presents their
	final construction and statistics and the methods used to evaluate system
	results.
	13 runs were submitted to the shared task by 4 teams, covering three of the
	four proposed language pairs: French-English (7 runs), German-English (3 runs),
	and Chinese-English (3 runs).
	The best F-scores as measured against the gold standard were 0.84
	(German-English), 0.80 (French-English), and 0.43 (Chinese-English).  Because
	of the design of the dataset, in which not all gold parallel sentence pairs are
	known, these are only minimum values.
	We examined manually a small sample of the false negative sentence pairs for
	the most precise French-English runs and estimated the number of parallel
	sentence pairs not yet in the provided gold standard.  Adding them to the gold
	standard leads to revised estimates for the French-English F-scores of at most
	+1.5pt.  This suggests that the BUCC 2017 datasets provide a reasonable
	approximate evaluation of the parallel sentence spotting task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zweigenbaum-sharoff-rapp:2017:BUCC</bibkey>
  </paper>

</volume>

