Jiaqi Sun

Also published as: 佳琦


2025

Knowledge-based complex reasoning remains a significant challenge for large language models (LLMs) with in-context learning. To tackle this issue, previous studies focus on ensuring behavior fidelity, factuality, or reliability in generated reasoning processes that guide LLMs to produce solutions. However, these studies often neglect the simultaneous optimization on all these three aspects for each thought. The main challenges are the lack of comprehensive assessment mechanisms and the difficulty of efficient thought-level optimization. This paper introduces the Evolution of Thoughts (EoT) framework, which enhances the factuality, fidelity, and reliability of each thought in the reasoning process through a few LLM inferences. We propose a thought assessment method that is sensitive to knowledge and LLM behaviors, using three scorers to evaluate each thought by considering domain context, semantic alignment, and behavior impact. Additionally, we establish a self-reflective evolution mechanism to facilitate each reasoning process generation in a single-forward inference. Extensive experiments demonstrate that, for knowledge-based complex tasks, EoT improves the factuality and fidelity of reasoning processes by approximately 16.5% and 48.8%, respectively, while enhancing LLM reasoning capability by about 6.2%, outperforming advanced approaches.

2023

“蒙古语语料库中语音多样性匮乏,虽然花费人力和经费收集数据在一定程度上能够增加语音的数量,但整个过程需要耗费大量的时间。数据增广能够解决这种数据匮乏问题,但数据增广模型的训练数据包含的环境噪声无法控制,导致增广语音中存在背景噪声。本文提出一种TTS和语音增强相结合的语音数据增广方法,以语音的频谱图为基础,从频域和时域两个维度进行语音增强。通过多组实验证明,蒙古语增广语音的合格率达到70%,增广语音的CBAK和COVL分别下降了0.66和0.81,WER和SER下降了2.75%和2.05%。”

2021

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intra-modality information. Inspired by this observation, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pre-train MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA-1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.