Ting-Rui Chiang

2026

Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities
Ting-Rui Chiang | Dani Yogatama
Findings of the Association for Computational Linguistics: EACL 2026

In this work, we propose a simple theoretical framework, Pelican Soup, aiming to better understand how pretraining allows LLMs to (1) generalize to unseen instructions and (2) perform in-context learning, even when the verbalizers are irrelevant to the task. To this end, in our framework, we introduce the notion of "knowledge base" and "reference-sense association" and a simple formalism for natural language processing tasks. Our framework demonstrates how linguistic, psychology, and philosophy studies can inform our understanding of the language model and is connected to several other existing theoretical results. As an illustration of the usage of our framework, we derive a bound on in-context learning loss with our framework. Finally, we support our framework with empirical experiments and provide possible future research directions.

2025

pdf bib abs

The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval
Ting-Rui Chiang | Dani Yogatama
Findings of the Association for Computational Linguistics: ACL 2025

The Rotary Position Embedding (RoPE) is widely used in the attention heads of many large language models (LLM). It rotates dimensions in the query and the key vectors by different angles according to their positions in the input sequence. For long context modeling, the range of positions may vary a lot, and thus RoPE rotates some dimensions by a great range of angles. We hypothesize that the wide range of rotation angles may prevent LLMs from utilizing those dimensions. To validate this hypothesis, we present a controlled experiment showing that applying RoPE causes low utility of certain dimensions. Our analyses on three LLMs also indicate that these dimensions do not help LLMs do long-context question answering.

2024

pdf bib abs

On Retrieval Augmentation and the Limitations of Language Model Training
Ting-Rui Chiang | Xinyan Yu | Joshua Robinson | Ollie Liu | Isabelle Lee | Dani Yogatama
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Augmenting a language model (LM) with k-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited possibility — the “softmax bottleneck.” We then create a new dataset to evaluate LM generalization ability in the setting where training data contains additional information that is not causally relevant. This task is challenging even for GPT-3.5 Turbo. We show that, for both GPT-2 and Mistral 7B, kNN retrieval augmentation consistently improves per formance in this setting. Finally, to make kNN retrieval more accessible, we propose using amulti-layer perceptron model that maps datastore keys to values as a drop-in replacement for traditional retrieval. This reduces storage costsby over 25x.

2023

pdf bib

On a Benefit of Masked Language Model Pretraining: Robustness to Simplicity Bias
Ting-Rui Chiang
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib abs

The Distributional Hypothesis Does Not Fully Explain the Benefits of Masked Language Model Pretraining
Ting-Rui Chiang | Dani Yogatama
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We analyze the masked language modeling pretraining objective function from the perspective of the Distributional Hypothesis. We investigate whether the better sample efficiency and the better generalization capability of models pretrained with masked language modeling can be attributed to the semantic similarity encoded in the pretraining data’s distributional property. Via a synthetic dataset, our analysis suggests that distributional property indeed leads to the better sample efficiency of pretrained masked language models, but does not fully explain the generalization capability. We also conduct an analysis over two real-world datasets and demonstrate that the distributional property does not explain the generalization ability of pretrained natural language models either. Our results illustrate our limited understanding of model pretraining and provide future research directions.

2022

pdf bib abs

DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Jessica Huynh | Ting-Rui Chiang | Jeffrey Bigham | Maxine Eskenazi
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.

pdf bib abs

Breaking Down Multilingual Machine Translation
Ting-Rui Chiang | Yi-Pei Chen | Yi-Ting Yeh | Graham Neubig
Findings of the Association for Computational Linguistics: ACL 2022

While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).

2021

pdf bib abs

Relating Neural Text Degeneration to Exposure Bias
Ting-Rui Chiang | Yun-Nung Chen
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

This work focuses on relating two mysteries in neural-based text generation: exposure bias, and text degeneration. Despite the long time since exposure bias was mentioned and the numerous studies for its remedy, to our knowledge, its impact on text generation has not yet been verified. Text degeneration is a problem that the widely-used pre-trained language model GPT-2 was recently found to suffer from (Holtzman et al., 2020). Motivated by the unknown causation of the text degeneration, in this paper we attempt to relate these two mysteries. Specifically, we first qualitatively and quantitatively identify mistakes made before text degeneration occurs. Then we investigate the significance of the mistakes by inspecting the hidden states in GPT-2. Our results show that text degeneration is likely to be partly caused by exposure bias. We also study the self-reinforcing mechanism of text degeneration, explaining why the mistakes amplify. In sum, our study provides a more concrete foundation for further investigation on exposure bias and text degeneration problems.

pdf bib abs

Improving Dialogue State Tracking by Joint Slot Modeling
Ting-Rui Chiang | Yi-Ting Yeh
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Dialogue state tracking models play an important role in a task-oriented dialogue system. However, most of them model the slot types conditionally independently given the input. We discover that it may cause the model to be confused by slot types that share the same data type. To mitigate this issue, we propose TripPy-MRF and TripPy-LSTM that models the slots jointly. Our results show that they are able to alleviate the confusion mentioned above, and they push the state-of-the-art on dataset MultiWoz 2.1 from 58.7 to 61.3.

2019

pdf bib abs

Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems
Ting-Rui Chiang | Yun-Nung Chen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Solving math word problems is a challenging task that requires accurate natural language understanding to bridge natural language texts and math expressions. Motivated by the intuition about how human generates the equations given the problem texts, this paper presents a neural approach to automatically solve math word problems by operating symbols according to their semantic meanings in texts. This paper views the process of generating equation as a bridge between the semantic world and the symbolic world, where the proposed neural math solver is based on an encoder-decoder framework. In the proposed model, the encoder is designed to understand the semantics of problems, and the decoder focuses on tracking semantic meanings of the generated symbols and then deciding which symbol to generate next. The preliminary experiments are conducted in a dataset Math23K, and our model significantly outperforms both the state-of-the-art single model and the best non-retrieval-based model over about 10% accuracy, demonstrating the effectiveness of bridging the symbolic and semantic worlds from math word problems.

Co-authors

Venues