2024
pdf
bib
abs
Aligning Speech Segments Beyond Pure Semantics
Kevin Heffernan
|
Artyom Kozhevnikov
|
Loic Barrault
|
Alexandre Mourachko
|
Holger Schwenk
Findings of the Association for Computational Linguistics: ACL 2024
Multilingual parallel data for speech-to-speech translation is scarce and expensive to create from scratch. This is all the more true for expressive speech translation, which aims at preserving not only the semantics, but also the overall prosody (e.g. style, emotion, rate-of-speech). Existing corpora contain speech utterances with the same meaning, yet the overall prosody is typically different, as human annotators are not tasked with reproducing these aspects, or crowed-sourced efforts do not specifically target this kind of alignment in priority. In this paper, we propose a novel alignment algorithm, which automatically forms pairs of speech segments aligned not only in meaning, but also in expressivity. In order to validate our approach, we train an expressive multilingual speech-to-speech translation system on the automatically aligned data. Our experiments show that in comparison to semantic-only approaches, expressively aligned data yields large improvements in source expressivity preservation (e.g. 43% uplift in speech rate preservation on average), while still maintaining content translation quality. In some scenarios, results also indicate that this alignment algorithm can outperform standard, semantic-focused approaches even on content translation quality.
2023
pdf
bib
abs
xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Mingda Chen
|
Kevin Heffernan
|
Onur Çelebi
|
Alexandre Mourachko
|
Holger Schwenk
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xsim, we show that xsim++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xsim++ also reports performance for different error types, offering more fine-grained feedbacks for model development.
pdf
bib
abs
Multilingual Representation Distillation with Contrastive Learning
Weiting Tan
|
Kevin Heffernan
|
Holger Schwenk
|
Philipp Koehn
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Multilingual sentence representations from large models encode semantic information from two or more languages and can be used for different cross-lingual information retrieval and matching tasks. In this paper, we integrate contrastive learning into multilingual representation distillation and use it for quality estimation of parallel sentences (i.e., find semantically similar sentences that can be used as translations of each other). We validate our approach with multilingual similarity search and corpus filtering tasks. Experiments across different low-resource languages show that our method greatly outperforms previous sentence encoders such as LASER, LASER3, and LaBSE.
2022
pdf
bib
abs
stopes - Modular Machine Translation Pipelines
Pierre Andrews
|
Guillaume Wenzek
|
Kevin Heffernan
|
Onur Çelebi
|
Anna Sun
|
Ammar Kamran
|
Yingzhe Guo
|
Alexandre Mourachko
|
Holger Schwenk
|
Angela Fan
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Neural machine translation, as other natural language deep learning applications, is hungry for data. As research evolves, the data pipelines supporting that research evolve too, oftentimes re-implementing the same core components. Despite the potential of modular codebases, researchers have but little time to put code structure and reusability first. Unfortunately, this makes it very hard to publish clean, reproducible code to benefit a wider audience. In this paper, we motivate and describe stopes , a framework that addresses these issues while empowering scalability and versatility for research use cases. This library was a key enabler of the No Language Left Behind project, establishing new state of the art performance for a multilingual machine translation model covering 200 languages. stopes and the pipelines described are released under the MIT license at
https://github.com/facebookresearch/stopes.
pdf
bib
abs
Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages
Kevin Heffernan
|
Onur Çelebi
|
Holger Schwenk
Findings of the Association for Computational Linguistics: EMNLP 2022
Scaling multilingual representation learning beyond the hundred most frequent languages is challenging, in particular to cover the long tail of low-resource languages. We move away from the popular one-for-all multilingual models and focus on training multiple language (family) specific representations, but most prominently enable all languages to still be encoded in the same representational space. We focus on teacher-student training, allowing all encoders to be mutually compatible for bitext mining, and enabling fast learning of new languages. We also combine supervised and self-supervised training, allowing encoders to take advantage of monolingual training data.Our approach significantly outperforms the original LASER encoder. We study very low-resource languages and handle 44 African languages, many of which are not covered by any other model. For these languages, we train sentence encoders and mine bitexts. Adding these mined bitexts yielded an improvement of 3.8 BLEU for NMT into English.
pdf
bib
abs
Problem-solving Recognition in Scientific Text
Kevin Heffernan
|
Simone Teufel
Proceedings of the Thirteenth Language Resources and Evaluation Conference
As far back as Aristotle, problems and solutions have been recognised as a core pattern of thought, and in particular of the scientific method. In this work, we present the novel task of problem-solving recognition in scientific text. Previous work on problem-solving either is not computational, is not adapted to scientific text, or has been narrow in scope. This work provides a new annotation scheme of problem-solving tailored to the scientific domain. We validate the scheme with an annotation study, and model the task using state-of-the-art baselines such as a Neural Relational Topic Model. The agreement study indicates that our annotation is reliable, and results from modelling show that problem-solving expressions in text can be recognised to a high degree of accuracy.
2020
pdf
bib
abs
Homonym normalisation by word sense clustering: a case in Japanese
Yo Sato
|
Kevin Heffernan
Proceedings of the 28th International Conference on Computational Linguistics
This work presents a method of word sense clustering that differentiates homonyms and merge homophones, taking Japanese as an example, where orthographical variation causes problem for language processing. It uses contextualised embeddings (BERT) to cluster tokens into distinct sense groups, and we use these groups to normalise synonymous instances to a single representative form. We see the benefit of this normalisation in language model, as well as in transliteration.
pdf
bib
abs
Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect
Yo Sato
|
Kevin Heffernan
Proceedings of the Twelfth Language Resources and Evaluation Conference
We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.
2018
pdf
bib
Creating dialect sub-corpora by clustering: a case in Japanese for an adaptive method
Yo Sato
|
Kevin Heffernan
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)