Author stylized rewriting is the task of rewriting an input text in a particular author’s style. Recent works in this area have leveraged Transformer-based language models in a denoising autoencoder setup to generate author stylized text without relying on a parallel corpus of data. However, these approaches are limited by the lack of explicit control of target attributes and being entirely data-driven. In this paper, we propose a Director-Generator framework to rewrite content in the target author’s style, specifically focusing on certain target attributes. We show that our proposed framework works well even with a limited-sized target author corpus. Our experiments on corpora consisting of relatively small-sized text authored by three distinct authors show significant improvements upon existing works to rewrite input texts in target author’s style. Our quantitative and qualitative analyses further show that our model has better meaning retention and results in more fluent generations.
While recent advances in language modeling has resulted in powerful generation models, their generation style remains implicitly dependent on the training data and can not emulate a specific target style. Leveraging the generative capabilities of a transformer-based language models, we present an approach to induce certain target-author attributes by incorporating continuous multi-dimensional lexical preferences of an author into generative language models. We introduce rewarding strategies in a reinforcement learning framework that encourages the use of words across multiple categorical dimensions, to varying extents. Our experiments demonstrate that the proposed approach can generate text that distinctively aligns with a given target author’s lexical style. We conduct quantitative and qualitative comparisons with competitive and relevant baselines to illustrate the benefits of the proposed approach.
In this work, we describe our system for WNUT-2020 shared task on the identification of informative COVID-19 English tweets. Our system is an ensemble of various machine learning methods, leveraging both traditional feature-based classifiers as well as recent advances in pre-trained language models that help in capturing the syntactic, semantic, and contextual features from the tweets. We further employ pseudo-labelling to incorporate the unlabelled Twitter data released on the pandemic. Our best performing model achieves an F1-score of 0.9179 on the provided validation set and 0.8805 on the blind test-set.