Yang Bai - ACL Anthology

Yang Bai

2026

Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds—crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio), by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.

2025

RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Yang Bai | Christan Grant | Daisy Zhe Wang
Findings of the Association for Computational Linguistics: NAACL 2025

Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Data and code will be made public once the paper is accepted.

2024

M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval
Yang Bai | Anthony Colas | Christan Grant | Zhe Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER.

2023

ATFormer: A Learned Performance Model with Transfer Learning Across Devices for Deep Learning Tensor Programs
Yang Bai | Wenqian Zhao | Shuo Yin | Zixiao Wang | Bei Yu
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The training and inference efficiency of ever-larger deep neural networks highly rely on the performance of tensor operators on specific hardware platforms. Therefore, a compilation-based optimization flow with automatic tensor generation and parameter tuning is necessary for efficient model deployment. While compilation-based methods with performance models can provide dynamic and suitable code optimization, they suffer from a large design space exploration with rough measurement accuracy and poor transferability among different hardware platforms. This paper presents ATFormer, a simple yet efficient design with attention-inspired modules to accurately predict the performance of optimized operators by capturing global and long-range dependencies within a complete scheduling space. Compared with state-of-the-arts, ATFormer can predict the optimal implementation of tensor operators to reduce inference time with minimal effort on modern DNN benchmarks. Furthermore, ATFormer with pre-trained parameters can quickly adapt to different workloads and hardware via transfer learning.

An Empirical Study of Frame Selection for Text-to-Video Retrieval
Mengxia Wu | Min Cao | Yang Bai | Ziyin Zeng | Chen Chen | Liqiang Nie | Min Zhang
Findings of the Association for Computational Linguistics: EMNLP 2023

Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.

2021

HUB@DravidianLangTech-EACL2021: Identify and Classify Offensive Text in Multilingual Code Mixing in Social Media
Bo Huang | Yang Bai
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This paper introduces the system description of the HUB team participating in DravidianLangTech - EACL2021: Offensive Language Identification in Dravidian Languages. The theme of this shared task is the detection of offensive content in social media. Among the known tasks related to offensive speech detection, this is the first task to detect offensive comments posted in social media comments in the Dravidian language. The task organizer team provided us with the code-mixing task data set mainly composed of three different languages: Malayalam, Kannada, and Tamil. The tasks on the code mixed data in these three different languages can be seen as three different comment/post-level classification tasks. The task on the Malayalam data set is a five-category classification task, and the Kannada and Tamil language data sets are two six-category classification tasks. Based on our analysis of the task description and task data set, we chose to use the multilingual BERT model to complete this task. In this paper, we will discuss our fine-tuning methods, models, experiments, and results.

HUB@DravidianLangTech-EACL2021: Meme Classification for Tamil Text-Image Fusion
Bo Huang | Yang Bai
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

This article describes our system for task DravidianLangTech - EACL2021: Meme classification for Tamil. In recent years, we have witnessed the rapid development of the Internet and social media. Compared with traditional TV and radio media platforms, there are not so many restrictions on the use of online social media for individuals and many functions of online social media platforms are free. Based on this feature of social media, it is difficult for people’s posts/comments on social media to be strictly and effectively controlled like TV and radio content. Therefore, the detection of negative information in social media has attracted attention from academic and industrial fields in recent years. The task of classifying memes is also driven by offensive posts/comments prevalent on social media. The data of the meme classification task is the fusion data of text and image information. To identify the content expressed by the meme, we develop a system that combines BiGRU and CNN. It can fuse visual features and text features to achieve the purpose of using multi-modal information from memetic data. In this article, we discuss our methods, models, experiments, and results.

TEAM HUB@LT-EDI-EACL2021: Hope Speech Detection Based On Pre-trained Language Model
Bo Huang | Yang Bai
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

This article introduces the system description of TEAM_HUB team participating in LT-EDI 2021: Hope Speech Detection. This shared task is the first task related to the desired voice detection. The data set in the shared task consists of three different languages (English, Tamil, and Malayalam). The task type is text classification. Based on the analysis and understanding of the task description and data set, we designed a system based on a pre-trained language model to complete this shared task. In this system, we use methods and models that combine the XLM-RoBERTa pre-trained language model and the Tf-Idf algorithm. In the final result ranking announced by the task organizer, our system obtained F1 scores of 0.93, 0.84, 0.59 on the English dataset, Malayalam dataset, and Tamil dataset. Our submission results are ranked 1, 2, and 3 respectively.

hub at SemEval-2021 Task 1: Fusion of Sentence and Word Frequency to Predict Lexical Complexity
Bo Huang | Yang Bai | Xiaobing Zhou
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this paper, we propose a method of fusing sentence information and word frequency information for the SemEval 2021 Task 1-Lexical Complexity Prediction (LCP) shared task. In our system, the sentence information comes from the RoBERTa model, and the word frequency information comes from the Tf-Idf algorithm. Use Inception block as a shared layer to learn sentence and word frequency information We described the implementation of our best system and discussed our methods and experiments in the task. The shared task is divided into two sub-tasks. The goal of the two sub-tasks is to predict the complexity of a predetermined word. The shared task is divided into two subtasks. The goal of the two subtasks is to predict the complexity of a predetermined word. The evaluation index of the task is the Pearson correlation coefficient. Our best performance system has Pearson correlation coefficients of 0.7434 and 0.8000 in the single-token subtask test set and the multi-token subtask test set, respectively.

hub at SemEval-2021 Task 2: Word Meaning Similarity Prediction Model Based on RoBERTa and Word Frequency
Bo Huang | Yang Bai | Xiaobing Zhou
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper introduces the system description of the hub team, which explains the related work and experimental results of our team’s participation in SemEval 2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC). The data of this shared task is mainly some cross-language or multi-language sentence pair corpus. The languages covered in the corpus include English, Chinese, French, Russian, and Arabic. The task goal is to judge whether the same words in these sentence pairs have the same meaning in the sentence. This can be seen as a task of binary classification of sentence pairs. What we need to do is to use our method to determine as accurately as possible the meaning of the words in a sentence pair are the same or different. The model used by our team is mainly composed of RoBERTa and Tf-Idf algorithms. The result evaluation index of task submission is the F1 score. We only participated in the English language task. The final score of the test set prediction results submitted by our team was 84.60.

hub at SemEval-2021 Task 5: Toxic Span Detection Based on Word-Level Classification
Bo Huang | Yang Bai | Xiaobing Zhou
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This article introduces the system description of the hub team, which explains the related work and experimental results of our team’s participation in SemEval 2021 Task 5: Toxic Spans Detection. The data for this shared task comes from some posts on the Internet. The task goal is to identify the toxic content contained in these text data. We need to find the span of the toxic text in the text data as accurately as possible. In the same post, the toxic text may be one paragraph or multiple paragraphs. Our team uses a classification scheme based on word-level to accomplish this task. The system we used to submit the results is ALBERT+BILSTM+CRF. The result evaluation index of the task submission is the F1 score, and the final score of the prediction result of the test set submitted by our team is 0.6640226029.

hub at SemEval-2021 Task 7: Fusion of ALBERT and Word Frequency Information Detecting and Rating Humor and Offense
Bo Huang | Yang Bai
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper introduces the system description of the hub team, which explains the related work and experimental results of our team’s participation in SemEval 2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense. We successfully submitted the test set prediction results of the two subtasks in the task. The goal of the task is to perform humor detection, grade evaluation, and offensive evaluation on each English text data in the data set. Tasks can be divided into two types of subtasks. One is a text classification task, and the other is a text regression task. What we need to do is to use our method to detect the humor and offensive information of the sentence as accurately as possible. The methods used in the results submitted by our team are mainly composed of ALBERT, CNN, and Tf-Idf algorithms. The result evaluation indicators submitted by the classification task are F1 score and Accuracy. The result evaluation index of the regression task submission is the RMSE. The final scores of the prediction results of the two subtask test sets submitted by our team are task1a 0.921 (F1), task1a 0.9364 (Accuracy), task1b 0.6288 (RMSE), task1c 0.5333 (F1), task1c 0.0.5591 (Accuracy), and task2 0.5027 (RMSE) respectively.

2020

Bridge the Gap: High-level Semantic Planning for Image Captioning
Chenxi Yuan | Yang Bai | Chun Yuan
Proceedings of the 28th International Conference on Computational Linguistics

Recent image captioning models have made much progress for exploring the multi-modal interaction, such as attention mechanisms. Though these mechanisms can boost the interaction, there are still two gaps between the visual and language domains: (1) the gap between the visual features and textual semantics, (2) the gap between the disordering of visual features and the ordering of texts. To bridge the gaps we propose a high-level semantic planning (HSP) mechanism that incorporates both a semantic reconstruction and an explicit order planning. We integrate the planning mechanism to the attention based caption model and propose the High-level Semantic PLanning based Attention Network (HS-PLAN). First, an attention based reconstruction module is designed to reconstruct the visual features with high-level semantic information. Then we apply a pointer network to serialize the features and obtain the explicit order plan to guide the generation. Experiments conducted on MS COCO show that our model outperforms previous methods and achieves the state-of-the-art performance of 133.4% CIDEr-D score.

DEEPYANG at SemEval-2020 Task 4: Using the Hidden Layer State of BERT Model for Differentiating Common Sense
Yang Bai | Xiaobing Zhou
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Introducing common sense to natural language understanding systems has received increasing research attention. To facilitate the researches on common sense reasoning, the SemEval-2020 Task 4 Commonsense Validation and Explanation(ComVE) is proposed. We participate in sub-task A and try various methods including traditional machine learning methods, deep learning methods, and also recent pre-trained language models. Finally, we concatenate the original output of BERT and the output vector of BERT hidden layer state to obtain more abundant semantic information features, and obtain competitive results. Our model achieves an accuracy of 0.8510 in the final test data and ranks 25th among all the teams.

BYteam at SemEval-2020 Task 5: Detecting Counterfactual Statements with BERT and Ensembles
Yang Bai | Xiaobing Zhou
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We participate in the classification tasks of SemEval-2020 Task: Subtask1: Detecting counterfactual statements of semeval-2020 task5(Detecting Counterfactuals). This paper examines different approaches and models towards detecting counterfactual statements classification. We choose the Bert model. However, the output of Bert is not a good summary of semantic information, so in order to obtain more abundant semantic information features, we modify the upper layer structure of Bert. Finally, our system achieves an accuracy of 88.90% and F1 score of 86.30% by hard voting, which ranks 6th on the final leader board of the in subtask 1 competition.

Automatic Detecting for Health-related Twitter Data with BioBERT
Yang Bai | Xiaobing Zhou
Proceedings of the Fifth Social Media Mining for Health Applications Workshop & Shared Task

Social media used for health applications usually contains a large amount of data posted by users, which brings various challenges to NLP, such as spoken language, spelling errors, novel/creative phrases, etc. In this paper, we describe our system submitted to SMM4H 2020: Social Media Mining for Health Applications Shared Task which consists of five sub-tasks. We participate in subtask 1, subtask 2-English, and subtask 5. Our final submitted approach is an ensemble of various fine-tuned transformer-based models. We illustrate that these approaches perform well in imbalanced datasets (For example, the class ratio is 1:10 in subtask 2), but our model performance is not good in extremely imbalanced datasets (For example, the class ratio is 1:400 in subtask 1). Finally, in subtask 1, our result is lower than the average score, in subtask 2-English, our result is higher than the average score, and in subtask 5, our result achieves the highest score. The code is available online.

Co-authors

Surya Teja Appini 1

Jingxiang Chen 1

Anthony Colas 1

Seong-Gyun Leem 1

Zhaojiang Lin 1

Florian Metze 1

Zhicheng Ouyang 1

Ariya Rastrow 1

Venues