Formality Style Transfer for Noisy, User-generated Conversations: Extracting Labeled, Parallel Data from Unlabeled Corpora

Isak Czeresnia Etinger; Alan W. Black

doi:10.18653/v1/D19-5502

Formality Style Transfer for Noisy, User-generated Conversations: Extracting Labeled, Parallel Data from Unlabeled Corpora

Abstract

Typical datasets used for style transfer in NLP contain aligned pairs of two opposite extremes of a style. As each existing dataset is sourced from a specific domain and context, most use cases will have a sizable mismatch from the vocabulary and sentence structures of any dataset available. This reduces the performance of the style transfer, and is particularly significant for noisy, user-generated text. To solve this problem, we show a technique to derive a dataset of aligned pairs (style-agnostic vs stylistic sentences) from an unlabeled corpus by using an auxiliary dataset, allowing for in-domain training. We test the technique with the Yahoo Formality Dataset and 6 novel datasets we produced, which consist of scripts from 5 popular TV-shows (Friends, Futurama, Seinfeld, Southpark, Stargate SG-1) and the Slate Star Codex online forum. We gather 1080 human evaluations, which show that our method produces a sizable change in formality while maintaining fluency and context; and that it considerably outperforms OpenNMT’s Seq2Seq model directly trained on the Yahoo Formality Dataset. Additionally, we publish the full pipeline code and our novel datasets.

Anthology ID:: D19-5502
Volume:: Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Month:: November
Year:: 2019
Address:: Hong Kong, China
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11–16
Language:
URL:: https://aclanthology.org/D19-5502/
DOI:: 10.18653/v1/D19-5502
Bibkey:
Cite (ACL):: Isak Czeresnia Etinger and Alan W Black. 2019. Formality Style Transfer for Noisy, User-generated Conversations: Extracting Labeled, Parallel Data from Unlabeled Corpora. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 11–16, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):: Formality Style Transfer for Noisy, User-generated Conversations: Extracting Labeled, Parallel Data from Unlabeled Corpora (Czeresnia Etinger & Black, WNUT 2019)
Copy Citation:
PDF:: https://aclanthology.org/D19-5502.pdf
Attachment:: D19-5502.Attachment.zip

PDF Cite Search Attachment Fix data