Yulong Pei


2024

pdf bib
More than Minorities and Majorities: Understanding Multilateral Bias in Language Generation
Jiaxu Zhao | Zijing Shi | Yitong Li | Yulong Pei | Ling Chen | Meng Fang | Mykola Pechenizkiy
Findings of the Association for Computational Linguistics ACL 2024

Pretrained models learned from real corpora can often capture undesirable features, leading to bias issues against different demographic groups. Most existing studies on bias dataset construction or bias mitigation methods only focus on one demographic group pair to study a certain bias, e.g. black vs. white for racial bias. However, in real-world applications, there are more than two demographic groups that are at risk of the same bias. In this paper, we propose to analyze and reduce biases across multiple demographic groups. We collect and build a multi-demographic bias dataset including five commonly discussed bias dimensions. To mitigate multi-demographic bias, we adopt several novel debiasing methods, including regularisation-based and augmentation-based methods, as well as appropriate evaluation metrics for multi-demographic bias measurement. Experimental results on the proposed multi-demographic dataset show that a fairer model can be achieved using a multi-demographic debiasing approach. Also, the model debiased using the proposed multi-demographic debiasing methods can better transfer to unseen demographics without sacrificing the performance of the pretrained model.

pdf bib
Can GPT models be Financial Analysts? An Evaluation of ChatGPT and GPT-4 on mock CFA Exams
Ethan Callanan | Amarachi Mbakwe | Antony Papadimitriou | Yulong Pei | Mathieu Sibue | Xiaodan Zhu | Zhiqiang Ma | Xiaomo Liu | Sameena Shah
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

pdf bib
DocLLM: A Layout-Aware Generative Language Model for Multimodal Document Understanding
Dongsheng Wang | Natraj Raman | Mathieu Sibue | Zhiqiang Ma | Petr Babkin | Simerjot Kaur | Yulong Pei | Armineh Nourbakhsh | Xiaomo Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Enterprise documents such as forms, receipts, reports, and other such records, often carry rich semantics at the intersection of textual and spatial modalities. The visual cues offered by their complex layouts play a crucial role in comprehending these documents effectively. In this paper, we present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Our model differs from existing multimodal LLMs by avoiding expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure. Specifically, the cross-alignment between text and spatial modalities is captured by decomposing the attention mechanism in classical transformers to a set of disentangled matrices. Furthermore, we devise a pre-training objective that learns to infill text segments. This approach allows us to address irregular layouts and heterogeneous content frequently encountered in visual documents. The pre-trained model is fine-tuned using a large-scale instruction dataset, covering four core document intelligence tasks. We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

2023

pdf bib
Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks
Xianzhi Li | Samuel Chan | Xiaodan Zhu | Yulong Pei | Zhiqiang Ma | Xiaomo Liu | Sameena Shah
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track

The most recent large language models (LLMs) such as ChatGPT and GPT-4 have shown exceptional capabilities of generalist models, achieving state-of-the-art performance on a wide range of NLP tasks with little or no adaptation. How effective are such models in the finance domain? Understanding this basic question would have a significant impact on many downstream financial analytical tasks. In this paper, we conduct empirical studies and provide experimental evidences of their performance on a wide variety of financial text analytical problems, using eight benchmark datasets from five categories of tasks. We report both the strengths and limitations of the current models by comparing them to the state-of-the-art fine-tuned approaches and the recently released domain-specific pretrained models. We hope our study can help to understand the capability of the existing models in the financial domain and facilitate further improvements.

2022

pdf bib
TweetFinSent: A Dataset of Stock Sentiments on Twitter
Yulong Pei | Amarachi Mbakwe | Akshat Gupta | Salwa Alamir | Hanxuan Lin | Xiaomo Liu | Sameena Shah
Proceedings of the Fourth Workshop on Financial Technology and Natural Language Processing (FinNLP)

Stock sentiment has strong correlations with the stock market but traditional sentiment analysis task classifies sentiment according to having feelings and emotions of good or bad. This definition of sentiment is not an accurate indicator of public opinion about specific stocks. To bridge this gap, we introduce a new task of stock sentiment analysis and present a new dataset for this task named TweetFinSent. In TweetFinSent, tweets are annotated based on if one gained or expected to gain positive or negative return from a stock. Experiments on TweetFinSent with several sentiment analysis models from lexicon-based to transformer-based have been conducted. Experimental results show that TweetFinSent dataset constitutes a challenging problem and there is ample room for improvement on the stock sentiment analysis task. TweetFinSent is available at https://github.com/jpmcair/tweetfinsent.

2012

pdf bib
A Supervised Aggregation Framework for Multi-Document Summarization
Yulong Pei | Wenpeng Yin | Qifeng Fan | Lian’en Huang
Proceedings of COLING 2012

pdf bib
RelationListwise for Query-Focused Multi-Document Summarization
Wenpeng Yin | Lifu Huang | Yulong Pei | Lian’en Huang
Proceedings of COLING 2012

pdf bib
SentTopic-MultiRank: a Novel Ranking Model for Multi-Document Summarization
Wenpeng Yin | Yulong Pei | Fan Zhang | Lian’en Huang
Proceedings of COLING 2012