Qian Yang


2024

pdf bib
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
Qian Yang | Jin Xu | Wenrui Liu | Yunfei Chu | Ziyue Jiang | Xiaohuan Zhou | Yichong Leng | Yuanjun Lv | Zhou Zhao | Chang Zhou | Jingren Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as automatic speech recognition, and lack an assessment of the open-ended generative capabilities centered around audio. Thus, it is challenging to track the progression in the Large Audio-Language Models (LALMs) domain and to provide guidance for future improvement.In this paper, we introduce AIR-Bench (Audio InstRuction Benchmark), the first benchmark designed to evaluate the ability of LALMs to understand various types of audio signals (including human speech, natural sounds, and music), and furthermore, to interact with humans in the textual format. AIR-Bench encompasses two dimensions: foundation and chat benchmarks. The former consists of 19 tasks with approximately 19k single-choice questions, intending to inspect the basic single-task ability of LALMs. The latter one contains 2k instances of open-ended question-and-answer data, directly assessing the comprehension of the model on complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. We design a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research. Dataset and evaluation code are available at https://github.com/OFA-Sys/AIR-Bench.

pdf bib
Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison
Qian Yang | Weixiang Yan | Aishwarya Agrawal
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM’s internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM’s direct answer. Experiments across six vision-language tasks with three VLMs show DeCC’s reliability estimation achieves better correlation with task accuracy compared to the existing methods.

pdf bib
Exploring the Best Practices of Query Expansion with Large Language Models
Le Zhang | Yihong Wu | Qian Yang | Jian-Yun Nie
Findings of the Association for Computational Linguistics: EMNLP 2024

Large Language Models (LLMs) are foundational in language technologies, particularly in information retrieval (IR). In this paper, we thoroughly explore the best practice of leveraging LLMs for query expansion. To this end, we introduce a training-free, straightforward yet effective framework called Multi-Text Generation Integration (MuGI). This approach leverages LLMs to generate multiple pseudo-references, which are then integrated with the original queries to enhance both sparse and dense retrieval methods. Additionally, we introduce a retrieval pipeline based on MuGI, which combines the strengths of sparse and dense retrievers to achieve superior performance without the need for costly pre-indexing. Our empirical findings reveal that: (1) Increasing the number of samples from LLMs benefits IR systems; (2) A balance between the query and pseudo-documents, and an effective integration strategy, is critical for high performance; (3) Contextual information from LLMs is essential, even boost a 23M model to outperform a 7B baseline model; (4) Pseudo relevance feedback can further calibrate queries for improved performance; and (5) Query expansion is widely applicable and versatile, consistently enhancing models ranging from 23M to 7B parameters. Our code and all generated references are made available at https://github.com/lezhang7/Retrieval_MuGI.

2023

pdf bib
FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models
Ziyue Jiang | Qian Yang | Jialong Zuo | Zhenhui Ye | Rongjie Huang | Yi Ren | Zhou Zhao
Findings of the Association for Computational Linguistics: ACL 2023

Stutter removal is an essential scenario in the field of speech editing. However, when the speech recording contains stutters, the existing text-based speech editing approaches still suffer from: 1) the over-smoothing problem in the edited speech; 2) lack of robustness due to the noise introduced by stutter; 3) to remove the stutters, users are required to determine the edited region manually. To tackle the challenges in stutter removal, we propose FluentSpeech, a stutter-oriented automatic speech editing model. Specifically, 1) we propose a context-aware diffusion model that iteratively refines the modified mel-spectrogram with the guidance of context features; 2) we introduce a stutter predictor module to inject the stutter information into the hidden sequence; 3) we also propose a stutter-oriented automatic speech editing (SASE) dataset that contains spontaneous speech recordings with time-aligned stutter labels to train the automatic stutter localization model. Experimental results on VCTK and LibriTTS datasets demonstrate that our model achieves state-of-the-art performance on speech editing. Further experiments on our SASE dataset show that FluentSpeech can effectively improve the fluency of stuttering speech in terms of objective and subjective metrics. Code and audio samples can be found at https://github.com/Zain-Jiang/Speech-Editing-Toolkit.

pdf bib
A Framework for Exploring Player Perceptions of LLM-Generated Dialogue in Commercial Video Games
Nader Akoury | Qian Yang | Mohit Iyyer
Findings of the Association for Computational Linguistics: EMNLP 2023

The growing capabilities of large language models (LLMs) have inspired recent efforts to integrate LLM-generated dialogue into video games. However, evaluation remains a major challenge: how do we assess the player experience in a commercial game augmented with LLM-generated dialogue? To explore this question, we introduce a dynamic evaluation framework for the dialogue management systems that govern the task-oriented dialogue often found in roleplaying video games. We first extract dialogue from the widely-acclaimed role-playing game *Disco Elysium: The Final Cut*, which contains 1.1M words of dialogue spread across a complex graph of utterances where node reachability depends on game state (e.g., whether a certain item is held). Using this dataset, we have GPT-4 perform *dialogue infilling* to generate grounded utterances based on game state represented via code. In a statistically robust study of 28 players recruited from the r/DiscoyElysium subreddit, the LLM outputs are evaluated against the game designers’ writing via both preference judgments and free-form feedback using a web interface that recreates the game’s core conversation functionality. Overall, the game designers’ prose is significantly preferred to GPT-4 generations, with participants citing reasons such as improved logical flow and grounding with the game state. To spur more principled future research in this area, we release our web interface and tools to enable researchers to build upon our work. https://pl.aiwright.dev

2022

pdf bib
Proceedings of the Second Workshop on Bridging Human--Computer Interaction and Natural Language Processing
Su Lin Blodgett | Hal Daumé III | Michael Madaio | Ani Nenkova | Brendan O'Connor | Hanna Wallach | Qian Yang
Proceedings of the Second Workshop on Bridging Human--Computer Interaction and Natural Language Processing

2021

pdf bib
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing
Su Lin Blodgett | Michael Madaio | Brendan O'Connor | Hanna Wallach | Qian Yang
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

2020

pdf bib
Improving Text Generation with Student-Forcing Optimal Transport
Jianqiao Li | Chunyuan Li | Guoyin Wang | Hao Fu | Yuhchen Lin | Liqun Chen | Yizhe Zhang | Chenyang Tao | Ruiyi Zhang | Wenlin Wang | Dinghan Shen | Qian Yang | Lawrence Carin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Neural language models are often trained with maximum likelihood estimation (MLE), where the next word is generated conditioned on the ground-truth word tokens. During testing, however, the model is instead conditioned on previously generated tokens, resulting in what is termed exposure bias. To reduce this gap between training and testing, we propose using optimal transport (OT) to match the sequences generated in these two modes. We examine the necessity of adding Student-Forcing scheme during training with an imitation learning interpretation. An extension is further proposed to improve the OT learning for long sequences, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.

2019

pdf bib
An End-to-End Generative Architecture for Paraphrase Generation
Qian Yang | Zhouyuan Huo | Dinghan Shen | Yong Cheng | Wenlin Wang | Guoyin Wang | Lawrence Carin
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Generating high-quality paraphrases is a fundamental yet challenging natural language processing task. Despite the effectiveness of previous work based on generative models, there remain problems with exposure bias in recurrent neural networks, and often a failure to generate realistic sentences. To overcome these challenges, we propose the first end-to-end conditional generative architecture for generating paraphrases via adversarial training, which does not depend on extra linguistic information. Extensive experiments on four public datasets demonstrate the proposed method achieves state-of-the-art results, outperforming previous generative architectures on both automatic metrics (BLEU, METEOR, and TER) and human evaluations.

pdf bib
Learning Compressed Sentence Representations for On-Device Text Processing
Dinghan Shen | Pengyu Cheng | Dhanasekar Sundararaman | Xinyuan Zhang | Qian Yang | Meng Tang | Asli Celikyilmaz | Lawrence Carin
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Vector representations of sentences, trained on massive text corpora, are widely used as generic sentence embeddings across a variety of NLP problems. The learned representations are generally assumed to be continuous and real-valued, giving rise to a large memory footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobile devices. In this paper, we propose four different strategies to transform continuous and generic sentence embeddings into a binarized form, while preserving their rich semantic information. The introduced methods are evaluated across a wide range of downstream tasks, where the binarized sentence embeddings are demonstrated to degrade performance by only about 2% relative to their continuous counterparts, while reducing the storage requirement by over 98%. Moreover, with the learned binary representations, the semantic relatedness of two sentences can be evaluated by simply calculating their Hamming distance, which is more computational efficient compared with the inner product operation between continuous embeddings. Detailed analysis and case study further validate the effectiveness of proposed methods.

2006

pdf bib
Development of a phoneme-to-phoneme (p2p) converter to improve the grapheme-to-phoneme (g2p) conversion of names
Qian Yang | Jean-Pierre Martens | Nanneke Konings | Henk van den Heuvel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

It is acknowledged that a good phonemic transcription of proper names is imperative for the success of many modern speech-based services such as directory assistance, car navigation, etc. It is also known that state-of-the-art general-purpose grapheme-to-phoneme (g2p) converters perform rather poorly on many name categories. This paper proposes to use a g2p-p2p tandem comprising a state-of-the-art general-purpose g2p converter that produces an initial transcription and a name category specific phoneme-to-phoneme (p2p) converter that aims at correcting the mistakes made by the g2p converter. The main body of the paper describes a novel methodology for the automatic construction of the p2p converter. The methodology is implemented in a software toolbox that will be made publicly available in a form that will permit the user to design a p2p converter for an arbitrary name category. To give a proof of concept, the toolbox was used for the development of three p2p converters for first names, surnames and geographical names respectively. The obtained systems are small (few rules) and effective: significant improvements (up to 50% relative) of the grapheme-to-phoneme conversion are obtained. These encouraging results call for a further development and improvement of the approach.