2024
pdf
bib
abs
BLSP-Emo: Towards Empathetic Large Speech-Language Models
Chen Wang
|
Minpeng Liao
|
Zhongqiang Huang
|
Junhong Wu
|
Chengqing Zong
|
Jiajun Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The recent release of GPT-4o showcased the potential of end-to-end multimodal models, not just in terms of low latency but also in their ability to understand and generate expressive speech with rich emotions. While the details are unknown to the open research community, it likely involves significant amounts of curated data and compute, neither of which is readily accessible. In this paper, we present BLSP-Emo (Bootstrapped Language-Speech Pretraining with Emotion support), a novel approach to developing an end-to-end speech-language model capable of understanding both semantics and emotions in speech and generate empathetic responses. BLSP-Emo utilizes existing speech recognition (ASR) and speech emotion recognition (SER) datasets through a two-stage process. The first stage focuses on semantic alignment, following recent work on pretraining speech-language models using ASR data. The second stage performs emotion alignment with the pretrained speech-language model on an emotion-aware continuation task constructed from SER data. Our experiments demonstrate that the BLSP-Emo model excels in comprehending speech and delivering empathetic responses, both in instruction-following tasks and conversations.
pdf
bib
abs
MARIO: MAth Reasoning with code Interpreter Output - A Reproducible Pipeline
Minpeng Liao
|
Chengxi Li
|
Wei Luo
|
Wu Jing
|
Kai Fan
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) have significantly improved in understanding natural language but still lack in mathematical reasoning, a hurdle on the path to true artificial general intelligence. The training of large language models, based on next-token prediction, struggles to capture the precise nature of mathematical reasoning, presenting both practical and theoretical challenges. In this paper, we address this challenge by enriching the data landscape and introducing a reasonable data format, enhanced the text analysis of the LLM with a capability to utilize a Python code interpreter. This dataset is derived from GSM8K and MATH and has been further refined through a combination of GPT annotations, human review, and self-training processes. Additionally, we propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs, which has led to a significant improvement in the performance of a 7B-parameter LLM on the GSM8K and MATH datasets. A solution generator and a value estimator are fine-tuned simultaneously in a multi-task fashion, while an outlier-free value model-based inference method is proposed to further boost the performance. We are committed to advancing the field of mathematical reasoning in LLMs and, to that end, we will make the source code and checkpoints publicly available.
pdf
bib
abs
wav2vec-S: Adapting Pre-trained Speech Models for Streaming
Biao Fu
|
Kai Fan
|
Minpeng Liao
|
Yidong Chen
|
Xiaodong Shi
|
Zhongqiang Huang
Findings of the Association for Computational Linguistics: ACL 2024
Pre-trained speech models, such as wav2vec 2.0, have significantly advanced speech-related tasks, including speech recognition and translation. However, their applicability in streaming scenarios is limited because these models are trained on complete utterances, leading to a mismatch with incremental streaming inputs. This paper identifies three critical design aspects within the architecture of wav2vec 2.0 and proposes a novel model, wav2vec-S, which incorporates simple modifications to ensure consistent speech representations during both training and inference phases for streaming speech inputs. Furthermore, we demonstrate that wav2vec-S models can be efficiently adapted from pre-trained wav2vec 2.0 models through continued pre-training and effectively finetuned to meet various latency requirements in downstream applications. Experiments on speech recognition and translation tasks show that wav2vec-S outperforms strong baseline models and achieves a superior balance between quality and latency.
pdf
bib
abs
Step-level Value Preference Optimization for Mathematical Reasoning
Guoxin Chen
|
Minpeng Liao
|
Chengxi Li
|
Kai Fan
Findings of the Association for Computational Linguistics: EMNLP 2024
Direct Preference Optimization (DPO) using an implicit reward model has proven to be an effective alternative to reinforcement learning from human feedback (RLHF) for fine-tuning preference aligned large language models (LLMs). However, the overall preference annotations of responses do not fully capture the fine-grained quality of model outputs in complex multi-step reasoning tasks, such as mathematical reasoning. To address this limitation, we introduce a novel algorithm called Step-level Value Preference Optimization (SVPO). Our approach employs Monte Carlo Tree Search (MCTS) to automatically annotate step-level preferences for multi-step reasoning. Furthermore, from the perspective of learning-to-rank, we train an explicit value model to replicate the behavior of the implicit reward model, complementing standard preference optimization. This value model enables the LLM to generate higher reward responses with minimal cost during inference. Experimental results demonstrate that our method achieves state-of-the-art performance on both in-domain and out-of-domain mathematical reasoning benchmarks.
2023
pdf
bib
abs
Towards Zero-shot Learning for End-to-end Cross-modal Translation Models
Jichen Yang
|
Kai Fan
|
Minpeng Liao
|
Boxing Chen
|
Zhongqiang Huang
Findings of the Association for Computational Linguistics: EMNLP 2023
One of the main problems in speech translation is the mismatches between different modalities. The second problem, scarcity of parallel data covering multiple modalities, means that the end-to-end multi-modal models tend to perform worse than cascade models, although there are exceptions under favorable conditions. To address these problems, we propose an end-to-end zero-shot speech translation model, connecting two pre-trained uni-modality modules via word rotator’s distance. The model retains the ability of zero-shot, which is like cascade models, and also can be trained in an end-to-end style to avoid error propagation. Our comprehensive experiments on the MuST-C benchmarks show that our end-to-end zero-shot approach performs better than or as well as those of the CTC-based cascade models and that our end-to-end model with supervised training also matches the latest baselines.
pdf
bib
abs
Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference
Biao Fu
|
Minpeng Liao
|
Kai Fan
|
Zhongqiang Huang
|
Boxing Chen
|
Yidong Chen
|
Xiaodong Shi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
A popular approach to streaming speech translation is to employ a single offline model with a wait-k policy to support different latency requirements, which is simpler than training multiple online models with different latency constraints. However, there is a mismatch problem in using a model trained with complete utterances for streaming inference with partial input. We demonstrate that speech representations extracted at the end of a streaming input are significantly different from those extracted from a complete utterance. To address this issue, we propose a new approach called Future-Aware Streaming Translation (FAST) that adapts an offline ST model for streaming input. FAST includes a Future-Aware Inference (FAI) strategy that incorporates future context through a trainable masked embedding, and a Future-Aware Distillation (FAD) framework that transfers future context from an approximation of full speech to streaming input. Our experiments on the MuST-C EnDe, EnEs, and EnFr benchmarks show that FAST achieves better trade-offs between translation quality and latency than strong baselines. Extensive analyses suggest that our methods effectively alleviate the aforementioned mismatch problem between offline training and online inference.