Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2

Richard He Bai; Peng Shi; Jimmy Lin; Luchen Tan; Kun Xiong; Wen Gao; Jie Liu; Ming Li

doi:10.18653/v1/2021.acl-srw.16

Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2

He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li

Abstract

The semantics of a text is manifested not only by what is read but also by what is not read. In this article, we will study how those implicit “not read” information such as end-of-paragraph () and end-of-sequence () affect the quality of text generation. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the in the fine-tuning stage. Experimental results on English story generation show that can lead to higher BLEU scores and lower perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without and during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with .

Anthology ID:: 2021.acl-srw.16
Volume:: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Month:: August
Year:: 2021
Address:: Online
Editors:: Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
Venues:: ACL | IJCNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 148–162
Language:
URL:: https://aclanthology.org/2021.acl-srw.16/
DOI:: 10.18653/v1/2021.acl-srw.16
Bibkey:
Cite (ACL):: He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, and Ming Li. 2021. Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 148–162, Online. Association for Computational Linguistics.
Cite (Informal):: Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2 (Bai et al., ACL-IJCNLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.acl-srw.16.pdf
Optionalsupplementarymaterial:: 2021.acl-srw.16.OptionalSupplementaryMaterial.zip
Video:: https://aclanthology.org/2021.acl-srw.16.mp4

PDF Cite Search Optionalsupplementarymaterial Video Fix data