Qiyu Wu

2025

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning
Zhuoyuan Mao | Mengjie Zhao | Qiyu Wu | Hiromi Wakaki | Yuki Mitsufuji
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model’s ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We open-source the codes, models and datasets we constructed: https://github.com/sony/DeepResonance.

pdf bib abs

Prompt Tuning Can Simply Adapt Large Language Models to Text Encoders
Kaiyan Zhao | Qiyu Wu | Zhongtao Miao | Yoshimasa Tsuruoka
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

Recently, many works have been attempting to adapt Large Language Models (LLMs) for sentence embedding, with most of them fine-tuning LLMs towards the contrastive objective and enabling bi-directional attention for better performance, using LoRA to address the large model scale.In this work, we suggest that this adaptation can also be simply and effectively achieved using causal attention and with even fewer trainable parameters through soft prompt tuning, as an alternative to fine-tuning with LoRA and other methods with extra post-training tasks.Our method only optimizes a few learnable tokens while keeping the rest of the model frozen.Through experiments on a diverse set of evaluation tasks, we find that simply tuning only a few tokens can achieve a competitive performance with that of fine-tuning with LoRA. The percentage of trainable parameters can be reduced to less than 0.001%. Moreover, we also demonstrate that turning causal attention to bi-directional attention with or without extra post-training tasks does not provide additional benefit when soft prompt tuning is applied, suggesting that causal attention can be naturally used in decoder-only LLMs for sentence embedding adaptation.

pdf bib abs

Improving Word Alignment Using Semi-Supervised Learning
Zhongtao Miao | Qiyu Wu | Masaaki Nagata | Yoshimasa Tsuruoka
Findings of the Association for Computational Linguistics: ACL 2025

Word alignment plays a crucial role in various natural language processing tasks, such as serving as cross-lingual signals for sentence embedding, reducing hallucination and omission in machine translation, and facilitating the construction of training data for simultaneous speech translation.Current state-of-the-art approaches usually rely on: (1) supervised data and large-scale weakly supervised data constructed from Wikipedia and (2) multilingual Transformer encoder-based models.However, we find that the current state-of-the-art encoder-based method, BinaryAlign, suffers from the issue of insufficient labeled data, and we further improve it with self-training with a small amount of parallel data. In addition, considering the impressive performance of multilingual large language models on many natural language processing tasks, we also explore the possibility of using these decoder-based large language models as word aligners. We observe that although fine-tuning large language models with labeled data produces acceptable results, augmenting the training with pseudo-labeled data further enhances model performance. Based on the findings, we propose a semi-supervised framework to improve the large language model-based word aligners. Experimental results demonstrate that the proposed method with a small amount of parallel data outperforms the current state-of-the-art method on various word alignment datasets.

pdf bib abs

Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.

2024

pdf bib abs

Leveraging Multi-lingual Positive Instances in Contrastive Learning to Improve Sentence Embedding
Kaiyan Zhao | Qiyu Wu | Xin-Qiang Cai | Yoshimasa Tsuruoka
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning multilingual sentence embeddings is a fundamental task in natural language processing. Recent trends in learning both monolingual and multilingual sentence embeddings are mainly based on contrastive learning (CL) among an anchor, one positive, and multiple negative instances. In this work, we argue that leveraging multiple positives should be considered for multilingual sentence embeddings because (1) positives in a diverse set of languages can benefit cross-lingual learning, and (2) transitive similarity across multiple positives can provide reliable structural information for learning.In order to investigate the impact of multiple positives in CL, we propose a novel approach, named MPCL, to effectively utilize multiple positive instances to improve the learning of multilingual sentence embeddings. Experimental results on various backbone models and downstream tasks demonstrate that MPCL leads to better retrieval, semantic similarity, and classification performance compared to conventional CL. We also observe that in unseen languages, sentence embedding models trained on multiple positives show better cross-lingual transfer performance than models trained on a single positive instance.

pdf bib abs

Word Alignment as Preference for Machine Translation
Qiyu Wu | Masaaki Nagata | Zhongtao Miao | Yoshimasa Tsuruoka
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The problem of hallucination and omission, a long-standing problem in machine translation (MT), is more pronounced when a large language model (LLM) is used in MT because an LLM itself is susceptible to these phenomena. In this work, we mitigate the problem in an LLM-based MT model by guiding it to better word alignment. We first study the correlation between word alignment and the phenomena of hallucination and omission in MT. Then we propose to utilize word alignment as preference to optimize the LLM-based MT model. The preference data are constructed by selecting chosen and rejected translations from multiple MT tools. Subsequently, direct preference optimization is used to optimize the LLM-based model towards the preference signal. Given the absence of evaluators specifically designed for hallucination and omission in MT, we further propose selecting hard instances and utilizing GPT-4 to directly evaluate the performance of the models in mitigating these issues. We verify the rationality of these designed evaluation methods by experiments, followed by extensive results demonstrating the effectiveness of word alignment-based preference optimization to mitigate hallucination and omission. On the other hand, although it shows promise in mitigating hallucination and omission, the overall performance of MT in different language directions remains mixed, with slight increases in BLEU and decreases in COMET.

pdf bib abs

Enhancing Cross-lingual Sentence Embedding for Low-resource Languages with Word Alignment
Zhongtao Miao | Qiyu Wu | Kaiyan Zhao | Zilong Wu | Yoshimasa Tsuruoka
Findings of the Association for Computational Linguistics: NAACL 2024

The field of cross-lingual sentence embeddings has recently experienced significant advancements, but research concerning low-resource languages has lagged due to the scarcity of parallel corpora. This paper shows that cross-lingual word representation in low-resource languages is notably under-aligned with that in high-resource languages in current models. To address this, we introduce a novel framework that explicitly aligns words between English and eight low-resource languages, utilizing off-the-shelf word alignment models. This framework incorporates three primary training objectives: aligned word prediction and word translation ranking, along with the widely used translation ranking. We evaluate our approach through experiments on the bitext retrieval task, which demonstrate substantial improvements on sentence embeddings in low-resource languages. In addition, the competitive performance of the proposed model across a broader range of tasks in high-resource languages underscores its practicality.

2023

pdf bib abs

WSPAlign: Word Alignment Pre-training via Large-Scale Weakly Supervised Span Prediction
Qiyu Wu | Masaaki Nagata | Yoshimasa Tsuruoka
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most existing word alignment methods rely on manual alignment datasets or parallel corpora, which limits their usefulness. Here, to mitigate the dependence on manual data, we broaden the source of supervision by relaxing the requirement for correct, fully-aligned, and parallel sentences. Specifically, we make noisy, partially aligned, and non-parallel paragraphs in this paper. We then use such a large-scale weakly-supervised dataset for word alignment pre-training via span prediction. Extensive experiments with various settings empirically demonstrate that our approach, which is named WSPAlign, is an effective and scalable way to pre-train word aligners without manual data. When fine-tuned on standard benchmarks, WSPAlign has set a new state of the art by improving upon the best supervised baseline by 3.3 6.1 points in F1 and 1.5 6.1 points in AER. Furthermore, WSPAlign also achieves competitive performance compared with the corresponding baselines in few-shot, zero-shot and cross-lingual tests, which demonstrates that WSPAlign is potentially more practical for low-resource languages than existing methods.

2022

pdf bib abs

Learning sentence embeddings in an unsupervised manner is fundamental in natural language processing. Recent common practice is to couple pre-trained language models with unsupervised contrastive learning, whose success relies on augmenting a sentence with a semantically-close positive instance to construct contrastive pairs. Nonetheless, existing approaches usually depend on a mono-augmenting strategy, which causes learning shortcuts towards the augmenting biases and thus corrupts the quality of sentence embeddings. A straightforward solution is resorting to more diverse positives from a multi-augmenting strategy, while an open question remains about how to unsupervisedly learn from the diverse positives but with uneven augmenting qualities in the text field. As one answer, we propose a novel Peer-Contrastive Learning (PCL) with diverse augmentations. PCL constructs diverse contrastive positives and negatives at the group level for unsupervised sentence embeddings. PCL performs peer-positive contrast as well as peer-network cooperation, which offers an inherent anti-bias ability and an effective way to learn from diverse augmentations. Experiments on STS benchmarks verify the effectiveness of PCL against its competitors in unsupervised sentence embeddings.

Co-authors

Can Xu 1

Venues

EACL1

Fix author