Tatiana Likhomanenko
2024
Generating Gender Alternatives in Machine Translation
Sarthak Garg
|
Mozhdeh Gheini
|
Clara Emmanuel
|
Tatiana Likhomanenko
|
Qin Gao
|
Matthias Paulik
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Machine translation (MT) systems often translate terms with ambiguous gender (e.g., English term “the nurse”) into the gendered form that is most prevalent in the systems’ training data (e.g., “enfermera”, the Spanish term for a female nurse). This often reflects and perpetuates harmful stereotypes present in society. With MT user interfaces in mind that allow for resolving gender ambiguity in a frictionless manner, we study the problem of generating all grammatically correct gendered translation alternatives. We open source train and test datasets for five language pairs and establish benchmarks for this task. Our key technical contribution is a novel semi-supervised solution for generating alternatives that integrates seamlessly with standard MT models and maintains high performance without requiring additional components or increasing inference overhead.
2023
Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data
Mozhdeh Gheini
|
Tatiana Likhomanenko
|
Matthias Sperber
|
Hendra Setiawan
Findings of the Association for Computational Linguistics: ACL 2023
Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient parallel data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing in this way results in additional gains on top of the vanilla pseudo-labeling setup providing a total improvement of up to 0.4% absolute WER and 2.1 BLEU points for En–De and 0.6% absolute WER and 2.2 BLEU points for En–Zh.
Search
Fix data
Co-authors
- Mozhdeh Gheini 2
- Clara Emmanuel 1
- Qin Gao 1
- Sarthak Garg 1
- Matthias Paulik 1
- show all...