Pengyu Ji
2024
Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale
Xiang Hu
|
Pengyu Ji
|
Qingyang Zhu
|
Wei Wu
|
Kewei Tu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner.We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion.We pre-train GPST on OpenWebText, a corpus with billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.
2023
Improving Span Representation by Efficient Span-Level Attention
Pengyu Ji
|
Songlin Yang
|
Kewei Tu
Findings of the Association for Computational Linguistics: EMNLP 2023
High-quality span representations are crucial to natural language processing tasks involving span prediction and classification. Most existing methods derive a span representation by aggregation of token representations within the span. In contrast, we aim to improve span representations by considering span-span interactions as well as more comprehensive span-token interactions. Specifically, we introduce layers of span-level attention on top of a normal token-level transformer encoder. Given that attention between all span pairs results in O(n4) complexity (n being the sentence length) and not all span interactions are intuitively meaningful, we restrict the range of spans that a given span could attend to, thereby reducing overall complexity to O(n3). We conduct experiments on various span-related tasks and show superior performance of our model surpassing baseline models. Our code is publicly available at https://github.com/jipy0222/Span-Level-Attention.
Search