Wenzhou Wu


2025

"The end-to-end speech translation task involves directly transforming speech into the text of another language, bypassing the generation of an intermediate transcription. However, existing methods may lose key information during cross-modal length alignment and fail to effectively integrate different representations, resulting in low quality of the fused representation. To address these issues, we propose an efficient method named CRAF for effective cross-modal alignment and fusion for speech translation, which reduces information loss and enhances the integration of cross-modal representations. First, CRAF minimizes information loss by improving the cross-modal length alignment, ensuring the alignment process retains more critical information from the speech modality. Second, CRAF strengthens the integration of cross-modal representations by allowing the model to combine complementary features from diverse modalities, enhancing its capacity to concentrate on the most pertinent and critical information. Finally, we evaluateCRAF by conducting extensive experiments on eight language pairs from the MuST-C dataset.Experiments show that the average BLEU score of CRAF achieves 29.0, outperforming other comparison methods. Our code is available at https://github.com/wu-wen-zhou/first/tree/master."