Bowen Chen


2024

pdf bib
A Multi-Perspective Analysis of Memorization in Large Language Models
Bowen Chen | Namgi Han | Yusuke Miyao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) can generate the same sequences contained in the pre-train corpora, known as memorization.Previous research studied it at a macro level, leaving micro yet important questions under-explored, e.g., what makes sentences memorized, the dynamics when generating memorized sequence, its connection to unmemorized sequence, and its predictability.We answer the above questions by analyzing the relationship of memorization with outputs from LLM, namely, embeddings, probability distributions, and generated tokens.A memorization score is calculated as the overlap between generated tokens and actual continuations when the LLM is prompted with a context sequence from the pre-train corpora.Our findings reveal:(1) The inter-correlation between memorized/unmemorized sentences, model size, continuation size, and context size, as well as the transition dynamics between sentences of different memorization scores,(2) A sudden drop and increase in the frequency of input tokens when generating memorized/unmemorized sequences (boundary effect),(3) Cluster of sentences with different memorization scores in the embedding space,(4) An inverse boundary effect in the entropy of probability distributions for generated memorized/unmemorized sequences,(5) The predictability of memorization is related to model size and continuation length. In addition, we show a Transformer model trained by the hidden states of LLM can predict unmemorized tokens.

2022

pdf bib
Syntactic and Semantic Uniformity for Semantic Parsing and Task-Oriented Dialogue Systems
Bowen Chen | Yusuke Miyao
Findings of the Association for Computational Linguistics: EMNLP 2022

This paper proposes a data representation framework for semantic parsing and task-oriented dialogue systems, aiming to achieve a uniform representation for syntactically and semantically diverse machine-readable formats.Current NLP systems heavily rely on adapting pre-trained language models to specific tasks, and this approach has been proven effective for modeling natural language texts.However, little attention has been paid to the representation of machine-readable formats, such as database queries and dialogue states.We present a method for converting original machine-readable formats of semantic parsing and task-oriented dialogue datasets into a syntactically and semantically uniform representation.We define a meta grammar for syntactically uniform representations and translate semantically equivalent functions into a uniform vocabulary.Empirical experiments on 13 datasets show that accuracy consistently improves over original formats, revealing the advantage of the proposed representation.Additionally, we show that the proposed representation allows for transfer learning across datasets.

pdf bib
CogBERT: Cognition-Guided Pre-trained Language Models
Xiao Ding | Bowen Chen | Li Du | Bing Qin | Ting Liu
Proceedings of the 29th International Conference on Computational Linguistics

We study the problem of integrating cognitive language processing signals (e.g., eye-tracking or EEG data) into pre-trained language models like BERT. Existing methods typically fine-tune pre-trained models on cognitive data, ignoring the semantic gap between the texts and cognitive signals. To fill the gap, we propose CogBERT, a framework that can induce fine-grained cognitive features from cognitive data and incorporate cognitive features into BERT by adaptively adjusting the weight of cognitive features for different NLP tasks. Extensive experiments show that: (1) Cognition-guided pre-trained models can consistently perform better than basic pre-trained models on ten NLP tasks. (2) Different cognitive features contribute differently to different NLP tasks. Based on this observation, we give a fine-grained explanation of why cognitive data is helpful for NLP. (3) Different transformer layers of pre-trained models should encode different cognitive features, with word-level cognitive features at the bottom and semantic-level cognitive features at the top. (4) Attention visualization demonstrates that CogBERT aligns with human gaze patterns and improves its natural language comprehension ability.