Yifan Xu


2024

pdf bib
ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline
Yifan Xu | Xiao Liu | Xinghan Liu | Zhenyu Hou | Yueyan Li | Xiaohan Zhang | Zihan Wang | Aohan Zeng | Zhengxiao Du | Zhao Wenyi | Jie Tang | Yuxiao Dong
Findings of the Association for Computational Linguistics: EMNLP 2024

Large language models (LLMs) have shown excellent mastering of human language but still struggle in real-world applications that require mathematical problem-solving. While many strategies and datasets to enhance LLMs’ mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM’s own generations for data collection. Based on ChatGLM3-32B, we conduct experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our pipeline significantly enhances the LLM’s mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger. Related techniques have been deployed to ChatGLM, an online serving LLM. Related evaluation datasets and scripts are released at https://github.com/THUDM/ChatGLM-Math.

pdf bib
Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?
Kai Sun | Yifan Xu | Hanwen Zha | Yue Liu | Xin Luna Dong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable are LLMs?To answer this question, we constructed Head-to-Tail, a benchmark that consists of 18K question-answer (QA) pairs regarding head, torso, and tail facts in terms of popularity. We designed an automated evaluation method and a set of metrics that closely approximate the knowledge an LLM confidently internalizes. Through a comprehensive evaluation of 16 publicly available LLMs, we show that existing LLMs are still far from being perfect in terms of their grasp of factual knowledge, especially for facts of torso-to-tail entities.

pdf bib
Personalized Review Recommendation based on Implicit dimension mining
Bei Xu | Yifan Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Users usually browse product reviews before buying products from e-commerce websites. Lots of e-commerce websites can recommend reviews. However, existing research on review recommendation mainly focuses on the general usefulness of reviews and ignores personalized and implicit requirements. To address the issue, we propose a Large language model driven Personalized Review Recommendation model based on Implicit dimension mining (PRR-LI). The model mines implicit dimensions from reviews and requirements, and encodes them in the form of “text + dimension”. The experiments show that our model significantly outperforms other state-of-the-art textual models on the Amazon-MRHP dataset, with some of the metrics outperforming the state-of-the-art multimodal models. And we prove that encoding “text + dimension” is better than encoding “text” and “dimension” separately in review recommendation.

pdf bib
AlignBench: Benchmarking Chinese Alignment of Large Language Models
Xiao Liu | Xuanyu Lei | Shengyuan Wang | Yue Huang | Andrew Feng | Bosi Wen | Jiale Cheng | Pei Ke | Yifan Xu | Weng Lam Tam | Xiaohan Zhang | Lichao Sun | Xiaotao Gu | Hongning Wang | Jing Zhang | Minlie Huang | Yuxiao Dong | Jie Tang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs is still significantly lacking, calling for real-scenario grounded, open-ended, challenging and automatic evaluations tailored for alignment. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. We tailor a human-in-the-loop data curation pipeline, containing 8 main categories, 683 real-scenario rooted queries and corresponding human verified references.To ensure references’ correctness, each knowledge-intensive query is accompanied with evidences collected from reliable webpages (including the url and quotation) by our annotators.For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge (CITATION) with Chain-of-Thought to generate explanations and final ratings as evaluations, ensuring high reliability and interpretability.All evaluation codes and data are publicly available at https://github.com/THUDM/AlignBench

pdf bib
A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation
Jifan Yu | Xiaohan Zhang | Yifan Xu | Xuanyu Lei | Zijun Yao | Jing Zhang | Lei Hou | Juanzi Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Empowered by the large-scale pretrained language models, existing dialogue systems have demonstrated impressive performance conducting fluent and natural-sounding conversations. However, they are still plagued by the <b>hallucination</b> problem, causing unpredictable factual errors in the generated responses. Recently, knowledge-grounded dialogue generation models, that intentionally invoke external knowledge resources to more informative responses, are also proven to be effective in reducing hallucination. Following the idea of getting high-quality knowledge, a few efforts have achieved pretty good performance on this issue. As some inevitable knowledge noises may also lead to hallucinations, it is emergent to investigate the reason and future directions for building noise-tolerant methods in KGD tasks. In this paper, we analyze the causal story behind this problem with counterfactual reasoning methods. Based on the causal effect analysis, we propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction. Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance, while keeping adaptive to different generation models. We hope our efforts can support and call for more attention to developing lightweight techniques towards robust and trusty dialogue systems.

2023

pdf bib
PersLEARN: Research Training through the Lens of Perspective Cultivation
Yu-Zhe Shi | Shiqian Li | Xinyi Niu | Qiao Xu | Jiawen Liu | Yifan Xu | Shiyu Gu | Bingru He | Xinyang Li | Xinyu Zhao | Zijian Zhao | Yidong Lyu | Zhen Li | Sijia Liu | Lin Qiu | Jinhao Ji | Lecheng Ruan | Yuxi Ma | Wenjuan Han | Yixin Zhu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Scientific research is inherently shaped by its authors’ perspectives, influenced by various factorssuch as their personality, community, or society. Junior researchers often face challenges in identifying the perspectives reflected in the existing literature and struggle to develop their own viewpoints. In response to this issue, we introduce PersLEARN , a tool designed to facilitate the cultivation of scientific perspectives, starting from a basic seed idea and progressing to a well-articulated framework. By interacting with a prompt-based model, researchers can develop their perspectives explicitly. Our humanstudy reveals that scientific perspectives developed by students using PersLEARN exhibit a superior level of logical coherence and depth compared to those that did not. Furthermore, our pipeline outperforms baseline approaches across multiple domains of literature from various perspectives. These results suggest that PersLEARN could help foster a greater appreciation of diversity in scientific perspectives as an essential component of research training.

2021

pdf bib
Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models
Tyler Chang | Yifan Xu | Weijian Xu | Zhuowen Tu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight convolutions, and we consider multiple new ways of integrating convolutions into Transformer self-attention. Specifically, we propose composite attention, which unites previous relative position encoding methods under a convolutional framework. We conduct experiments by training BERT with composite attention, finding that convolutions consistently improve performance on multiple downstream tasks, replacing absolute position embeddings. To inform future work, we present results comparing lightweight convolutions, dynamic convolutions, and depthwise-separable convolutions in language model pre-training, considering multiple injection points for convolutions in self-attention layers.