Yu Bai (白宇) - ACL Anthology

Yu Bai

Also published as: 宇白

2026

Current multi-modal benchmarks primarily focus on facts within individual images. However, they overlook the associative relations among multiple images, which necessitate conducting commonsense reasoning grounded in associated knowledge at different granularities (i.e., image-level and entity-level) as well as the ability to perceive the order of images. Therefore, we propose a multi-image relational association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. To systematically evaluate current LVLMs, we establish a system of associative relations among images that contains 11 subtasks (e.g., UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., image-level and entity-level), based on relations in ConceptNet. Our experiments reveal that entity-level multi-image perception tasks pose greater challenges for LVLMs than image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating limited spatial awareness. Furthermore, we find that LVLMs exhibit weak image order perception capabilities, and we design a method to significantly improve this ability, demonstrating that most current LVLMs do not adequately consider image order perception during pre-training.

2025

pdf bib abs

Logical table-to-text generation (LT2T) seeks to produce logically faithful textual descriptions base on tables. Current end-to-end LT2T models, which use descriptions directly as learning objectives, frequently face challenges in maintaining logical faithfulness due to the lack of a reasoning knowledge. Recent research have introduced reasoning knowledge generated by models for LT2T task, but the noise along with it limited its performance. We therefore propose a framework reasoning knowledge filter that leverages the collaboration between large language models and smaller models to filter data points with high-quality reasoning knowledge. This framework aims to provide highly matched table, description and reasoning knowledge triplets for LT2T. The results obtained on LogicNLG database demonstrate that the efficiencies of the method in this paper has achieved optimal performance with a reduced amount of data. Specifically, it enhances SP-Acc by 1.4 points and NLI-Acc by 0.7 points compared to the current state-of-the-art model.

pdf bib abs

VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts
Yu Bai | Lianji Wang | Xiang Liu | Haifeng Chi | Guiping Zhang
Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025

With the continuous growth of multi-modal data on social media platforms, traditional Named Entity Recognition has rendered insufficient for handling contemporary data formats. Consequently, researchers proposed Multi-modal Named Entity Recognition (MNER). Existing studies focus on capturing the visual regions corresponding to entities to assist in entity recognition. However, these approaches still struggle to mitigate interference from visual regions that are irrelevant to the entities. To address this issue, we propose an innovative framework, Visual Cue Refinement in MNER(VCRMNER) using CLIP Prompts, to accurately capture visual cues (object-level visual regions) associated with entities. We leverage prompts to represent the semantic information of entity categories, which helps us assess visual cues and minimize interference from those irrelevant to the entities. Furthermore, we designed an interaction transformer that operates in two stages—first within each modality and then between modalities—to refine visual cues by learning from a frozen image encoder, thereby reducing differences between text and visual modalities. Comprehensive experiments were conducted on two public datasets, Twitter15 and Twitter17. The results and detailed analyses demonstrate that our method exhibits robust and competitive performance.

2024

Large Language Models (LLMs) demonstrate significant value in domain-specific applications, benefiting from their fundamental capabilities. Nevertheless, it is still unclear which fundamental capabilities contribute to success in specific domains. Moreover, the existing benchmark-based evaluation cannot effectively reflect the performance of real-world applications. In this survey, we review recent advances of LLMs in domain applications, aiming to summarize the fundamental capabilities and their collaboration. Furthermore, we establish connections between fundamental capabilities and specific domains, evaluating the varying importance of different capabilities. Based on our findings, we propose a reliable strategy for domains to choose more robust backbone LLMs for real-world applications.

pdf bib abs

基于交互行为语义模式增强的ID推荐方法(Enhanced ID Recommendation Method Utilizing Semantic Patterns of Interactive Behaviors)
Yuanlai Wang (王远来) | Yu Bai (白宇) | Peng Lian (廉鹏)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“基于ID的推荐是一种依赖用户或物品的唯一标识符进行推荐的经典推荐方法,这种方法经常面临用户物品交互数据稀疏、符号ID缺失语义信息等问题。该文针对上述问题,假设不同领域的用户-物品交互行为之间存在潜在的模式关联,提出了一种基于交互行为语义模式增强的ID推荐方法。该方法在目标域推荐任务中引入辅助域信息,基于图神经网络对辅助域和目标域信息进行联合编码表示,通过引入交互行为语义模式,将辅助域的用户-物品交互信息以及物品描述信息迁移至目标域,从而实现目标域ID推荐中的交互行为语义增强。在8个公开数据集上的实验结果表明,相比目前的SOTA模型,本文方法表现出更好的推荐效果,其Recall@20与NDCG@20分别具有3% ∼ 30%、1% ∼ 40%的提升。”

pdf bib abs

面向小规模大语言模型推理优化的推理路径排序方法(A Reasoning Paths Ranking Method for Reasoning Optimization of Small-scale Large Language Models)
Jun Li (李俊) | Yu Bai (白宇) | Yuting Liu (刘雨婷)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“尽管大语言模型(LLM)在自然语言处理领域取得巨大成功,但是伴随其千亿级参数规模的训练也产生了巨大的计算成本。小规模大语言模型(SLLM)作为低资源场景下实现LLM部署的可替代方案,任务处理能力与LLM尚存在明显差距。尽管上下文学习(ICL)等提示方法在一定程度上提升了SLLM的问题处理能力,但基于人工构建的提示往往需要参与者具备特定的专业领域知识,这给LLM的普适推广带来了挑战。针对以上问题,本文提出了一个基于SLLM的问题推理框架,通过在推理路径生成和答案生成两个阶段之间引入基于逐步语义验证器(SSVRP)的推理路径排序选择机制,在无人干预情况下实现SLLM推理能力提升。实验结果表明,SSVRP有效地增强了SLLM的推理性能,在4个推理任务中的平均准确率分别达到了54.3%,90.6%,64.3%和63.7%,并在其中3个推理任务中都取得了最新的SOTA结果。”

pdf bib abs

Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) withoutaffecting the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance, often drop information that is important for solving downstream tasks, a problem which we call information neglect. To address this issue, we introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. In addition, we design a method for chunked sequence processing to further improve efficiency. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget, while preserving language modeling perplexity. The code and data have been released at https://github.com/ybai-nlp/CItruS.

pdf bib abs

How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment
Heyan Huang | Yinghao Li | Huashan Sun | Yu Bai | Yang Gao
Findings of the Association for Computational Linguistics: EMNLP 2024

Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model’s alignment performance. Our findings indicate that the example part is crucial for enhancing the model’s alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA’s zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following. Source codes and scripts are available at https://github.com/li-aolong/how-far-can-ica-go.

2023

pdf bib abs

Non-autoregressive machine translation (NAT) models have lower translation quality than autoregressive translation (AT) models because NAT decoders do not depend on previous target tokens in the decoder input. We propose a novel and general Dependency-Aware Decoder (DePA) to enhance target dependency modeling in the decoder of fully NAT models from two perspectives: decoder self-attention and decoder input. First, we propose an autoregressive forward-backward pre-training phase before NAT training, which enables the NAT decoder to gradually learn bidirectional target dependencies for the final NAT training. Second, we transform the decoder input from the source language representation space to the target language representation space through a novel attentive transformation process, which enables the decoder to better capture target dependencies. DePA can be applied to any fully NAT models. Extensive experiments show that DePA consistently improves highly competitive and state-of-the-art fully NAT models on widely used WMT and IWSLT benchmarks by up to 1.88 BLEU gain, while maintaining the inference latency comparable to other fully NAT models.

2022

pdf bib abs

Few-shot abstractive summarization has become a challenging task in natural language generation. To support it, we developed a novel soft prompts architecture coupled with a prompt pre-training plus prompt fine-tuning paradigm, which is effective and tunes only extremely light parameters. To meet the structure of the generation models, the soft prompts comprise continuous input embeddings across an encoder and a decoder. Importantly, a new inner-prompt placed in the text is introduced to capture document-level information. The aim is to devote attention to understanding the document that better prompts the model to generate document-related content. In the training process, the prompt pre-training with self-supervised pseudo-data firstly teaches the model basic summarizing capability. Then, with few-shot examples, only the designed lightweight soft prompts are fine-tuned. Experimental results on the CNN/DailyMail and XSum datasets show that our method, with only 0.1% of the parameters, outperforms full-model tuning where all model parameters are tuned. It also surpasses Prompt Tuning by a large margin and delivers competitive results against Prefix-Tuning with 3% of the parameters.

pdf bib abs

Conformal Predictor for Improving Zero-Shot Text Classification Efficiency
Prafulla Kumar Choubey | Yu Bai | Chien-Sheng Wu | Wenhao Liu | Nazneen Rajani
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models (PLMs) have been shown effective for zero-shot (0shot) text classification. 0shot models based on natural language inference (NLI) and next sentence prediction (NSP) employ cross-encoder architecture and infer by making a forward pass through the model for each label-text pair separately. This increases the computational cost to make inferences linearly in the number of labels. In this work, we improve the efficiency of such cross-encoder-based 0shot models by restricting the number of likely labels using another fast base classifier-based conformal predictor (CP) calibrated on samples labeled by the 0shot model. Since a CP generates prediction sets with coverage guarantees, it reduces the number of target labels without excluding the most probable label based on the 0shot model. We experiment with three intent and two topic classification datasets. With a suitable CP for each dataset, we reduce the average inference time for NLI- and NSP-based models by 25.6% and 22.2% respectively, without dropping performance below the predefined error rate of 1%.

2021

pdf bib abs

Cross-Lingual Abstractive Summarization with Limited Parallel Resources
Yu Bai | Yang Gao | Heyan Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Parallel cross-lingual summarization data is scarce, requiring models to better use the limited available cross-lingual resources. Existing methods to do so often adopt sequence-to-sequence networks with multi-task frameworks. Such approaches apply multiple decoders, each of which is utilized for a specific task. However, these independent decoders share no parameters, hence fail to capture the relationships between the discrete phrases of summaries in different languages, breaking the connections in order to transfer the knowledge of the high-resource languages to low-resource languages. To bridge these connections, we propose a novel Multi-Task framework for Cross-Lingual Abstractive Summarization (MCLAS) in a low-resource setting. Employing one unified decoder to generate the sequential concatenation of monolingual and cross-lingual summaries, MCLAS makes the monolingual summarization task a prerequisite of the CLS task. In this way, the shared decoder learns interactions involving alignments and summary patterns across languages, which encourages attaining knowledge transfer. Experiments on two CLS datasets demonstrate that our model significantly outperforms three baseline models in both low-resource and full-dataset scenarios. Moreover, in-depth analysis on the generated summaries and attention heads verifies that interactions are learned well using MCLAS, which benefits the CLS task under limited parallel resources.