Proceedings of the 10th Workshop on Building and Using Comparable Corpora

Proceedings of the 10th Workshop on Building and Using Comparable Corpora SergeSharoff PierreZweigenbaum ReinhardRapp August 2017

Vancouver, Canada

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-25 book BUCC:2017 Users and Data: The Two Neglected Children of Bilingual Natural Language Processing Research PhillippeLanglais Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 1–5 http://www.aclweb.org/anthology/W17-2501 Despite numerous studies devoted to mining parallel material from bilingual data, we have yet to see the resulting technologies wholeheartedly adopted by professional translators and terminologists alike. I argue that this state of affairs is mainly due to two factors: the emphasis published authors put on models (even though data is as important), and the conspicuous lack of concern for actual end-users. inproceedings langlais:2017:BUCC Deep Investigation of Cross-Language Plagiarism Detection Methods JérémyFerrero LaurentBesacier DidierSchwab FrédéricAgnès Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 6–15 http://www.aclweb.org/anthology/W17-2502 W17-2502.Presentation.pdf This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and com- parable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages. inproceedings ferrero-EtAl:2017:BUCC Sentence Alignment using Unfolding Recursive Autoencoders JeenuGrover PabitraMitra Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 16–20 http://www.aclweb.org/anthology/W17-2503 In this paper, we propose a novel two step algorithm for sentence alignment in monolingual corpora using Unfolding Recursive Autoencoders. First, we use unfolding recursive auto-encoders (RAE) to learn feature vectors for phrases in syntactical tree of the sentence. To compare two sentences we use a similarity matrix which has dimensions proportional to the size of the two sentences. Since the similarity matrix generated to compare two sentences has varying dimension due to different sentence lengths, a dynamic pooling layer is used to map it to a matrix of fixed dimension. The resulting matrix is used to calculate the similarity scores between the two sentences. The second step of the algorithm captures the contexts in which the sentences occur in the document by using a dynamic programming algorithm for global alignment. inproceedings grover-mitra:2017:BUCC Acquisition of Translation Lexicons for Historically Unwritten Languages via Bridging Loanwords MichaelBloodgood BenjaminStrauss Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 21–25 http://www.aclweb.org/anthology/W17-2504 W17-2504.Presentation.pdf With the advent of informal electronic communications such as social media, colloquial languages that were historically unwritten are being written for the first time in heavily code-switched environments. We present a method for inducing portions of translation lexicons through the use of expert knowledge in these settings where there are approximately zero resources available other than a language informant, potentially not even large amounts of monolingual data. We investigate inducing a Moroccan Darija-English translation lexicon via French loanwords bridging into English and find that a useful lexicon is induced for human-assisted translation and statistical machine translation. inproceedings bloodgood-strauss:2017:BUCC Toward a Comparable Corpus of Latvian, Russian and English Tweets DmitrijsMilajevs Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 26–30 http://www.aclweb.org/anthology/W17-2505 Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by collecting and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus. inproceedings milajevs:2017:BUCC Automatic Extraction of Parallel Speech Corpora from Dubbed Movies AlpÖktem MireiaFarrús LeoWanner Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 31–35 http://www.aclweb.org/anthology/W17-2506 W17-2506.Presentation.pdf This paper presents a methodology to extract parallel speech corpora based on any language pair from dubbed movies, together with an application framework in which some corresponding prosodic parameters are extracted. The obtained parallel corpora are especially suitable for speech-to-speech translation applications when a prosody transfer between source and target languages is desired. inproceedings oktem-farrus-wanner:2017:BUCC A parallel collection of clinical trials in Portuguese and English MarianaNeves Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 36–40 http://www.aclweb.org/anthology/W17-2507 W17-2507.Presentation.pdf Parallel collections of documents are crucial resources for training and evaluating machine translation (MT) systems. Even though large collections are available for certain domains and language pairs, these are still scarce in the biomedical domain. We developed a parallel corpus of clinical trials in Portuguese and English. The documents are derived from the Brazilian Clinical Trials Registry and the corpus currently contains a total of 1188 documents. In this paper, we describe the corpus construction and discuss the quality of the translation and the sentence alignment that we obtained. inproceedings neves:2017:BUCC Weighted Set-Theoretic Alignment of Comparable Sentences AndoniAzpeitia ThierryEtchegoyhen EvaMartínez Garcia Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 41–45 http://www.aclweb.org/anthology/W17-2508 This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task. inproceedings azpeitia-etchegoyhen-martinezgarcia:2017:BUCC BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora FrancisGrégoire PhilippeLanglais Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 46–50 http://www.aclweb.org/anthology/W17-2509 This paper describes our participation in BUCC 2017 shared task: identifying parallel sentences in comparable corpora. Our goal is to leverage continuous vector representations and distributional semantics with a minimal use of external preprocessing and postprocessing tools. We report experiments that were conducted after transmitting our results. inproceedings gregoire-langlais:2017:BUCC zNLP: Identifying Parallel Sentences in Chinese-English Comparable Corpora ZhengZhang PierreZweigenbaum Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 51–55 http://www.aclweb.org/anthology/W17-2510 This paper describes the zNLP system for the BUCC 2017 shared task. Our system identifies parallel sentence pairs in Chinese-English comparable corpora by translating word-by-word Chinese sentences into English, using the search engine Solr to select near-parallel sentences and then by using an SVM classifier to identify true parallel sentences from the previous results. It obtains an F1-score of 45% (resp. 32%) on the test (training) set. inproceedings zhang-zweigenbaum:2017:BUCC BUCC2017: A Hybrid Approach for Identifying Parallel Sentences in Comparable Corpora SainikMahata DipankarDas SivajiBandyopadhyay Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 56–59 http://www.aclweb.org/anthology/W17-2511 A Statistical Machine Translation (SMT) system is always trained using large parallel corpus to produce effective translation. Not only is the corpus scarce, it also involves a lot of manual labor and cost. Parallel corpus can be prepared by employing comparable corpora where a pair of corpora is in two different languages pointing to the same domain. In the present work, we try to build a parallel corpus for French-English language pair from a given comparable corpus. The data and the problem set are provided as part of the shared task organized by BUCC 2017. We have proposed a system that first translates the sentences by heavily relying on Moses and then group the sentences based on sentence length similarity. Finally, the one to one sentence selection was done based on Cosine Similarity algorithm. inproceedings mahata-das-bandyopadhyay:2017:BUCC Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora PierreZweigenbaum SergeSharoff ReinhardRapp Proceedings of the 10th Workshop on Building and Using Comparable Corpora August 2017

Vancouver, Canada

Association for Computational Linguistics 60–67 http://www.aclweb.org/anthology/W17-2512 This paper presents the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. It recalls the design of the datasets, presents their final construction and statistics and the methods used to evaluate system results. 13 runs were submitted to the shared task by 4 teams, covering three of the four proposed language pairs: French-English (7 runs), German-English (3 runs), and Chinese-English (3 runs). The best F-scores as measured against the gold standard were 0.84 (German-English), 0.80 (French-English), and 0.43 (Chinese-English). Because of the design of the dataset, in which not all gold parallel sentence pairs are known, these are only minimum values. We examined manually a small sample of the false negative sentence pairs for the most precise French-English runs and estimated the number of parallel sentence pairs not yet in the provided gold standard. Adding them to the gold standard leads to revised estimates for the French-English F-scores of at most +1.5pt. This suggests that the BUCC 2017 datasets provide a reasonable approximate evaluation of the parallel sentence spotting task. inproceedings zweigenbaum-sharoff-rapp:2017:BUCC