2024
pdf
bib
abs
SimSCOOD: Systematic Analysis of Out-of-Distribution Generalization in Fine-tuned Source Code Models
Hossein Hajipour
|
Ning Yu
|
Cristian-Alexandru Staicu
|
Mario Fritz
Findings of the Association for Computational Linguistics: NAACL 2024
Large code datasets have become increasingly accessible for pre-training source code models. However, for the fine-tuning phase, obtaining representative training data that fully covers the code distribution for specific downstream tasks remains challenging due to the task-specific nature and limited labeling resources. These lead to out-of-distribution (OOD) generalization issues with unexpected model inference behaviors that have not been systematically studied yet.In this paper, we contribute the first systematic approach that simulates various OOD scenarios along different dimensions of source code data properties and study the fine-tuned model behaviors in such scenarios. We investigate the behaviors of models under different fine-tuning methodologies, including full fine-tuning and Low-Rank Adaptation (LoRA) fine-tuning methods. Our comprehensive analysis, conducted on four state-of-the-art pretrained models and applied to two code generation tasks, exposes multiple failure modes attributed to OOD generalization issues.
pdf
bib
abs
FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation
Si Chen
|
Feiyang Kang
|
Ning Yu
|
Ruoxi Jia
Findings of the Association for Computational Linguistics: EMNLP 2024
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100% improvement in F1 score over the state-of-the-art methods while being x33 faster than TracIn.
2021
pdf
bib
abs
基于义原表示学习的词向量表示方法(Word Representation based on Sememe Representation Learning)
Ning Yu (于宁)
|
Jiangping Wang (王江萍)
|
Yu Shi (石宇)
|
Jianyi Liu (刘建毅)
Proceedings of the 20th Chinese National Conference on Computational Linguistics
本文利用知网(HowNet)中的知识,并将Word2vec模型的结构和思想迁移至义原表示学习过程中,提出了一个基于义原表示学习的词向量表示方法。首先,本文利用OpenHowNet获取义原知识库中的所有义原、所有中文词汇以及所有中文词汇和其对应的义原集合,作为实验的数据集。然后,基于Skip-gram模型,训练义原表示学习模型,进而获得词向量。最后,通过词相似度任务、词义消歧任务、词汇类比和观察最近邻义原,来评价本文提出的方法获取的词向量的效果。通过和基线模型比较,发现本文提出的方法既高效又准确,不依赖大规模语料也不需要复杂的网络结构和繁多的参数,也能提升各种自然语言处理任务的准确率。
2020
pdf
bib
abs
Corpus Development for Studying Online Disinformation Campaign: A Narrative + Stance Approach
Mack Blackburn
|
Ning Yu
|
John Berrie
|
Brian Gordon
|
David Longfellow
|
William Tirrell
|
Mark Williams
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management
Disinformation on social media is impacting our personal life and society. The outbreak of the new coronavirus is the most recent example for which a wealth of disinformation provoked fear, hate, and even social panic. While there are emerging interests in studying how disinformation campaigns form, spread, and influence target audiences, developing disinformation campaign corpora is challenging given the high volume, fast evolution, and wide variation of messages associated with each campaign. Disinformation cannot always be captured by simple factchecking, which makes it even more challenging to validate and create ground truth. This paper presents our approach to develop a corpus for studying disinformation campaigns targeting the White Helmets of Syria. We bypass directly classifying a piece of information as disinformation or not. Instead, we label the narrative and stance of tweets and YouTube comments about White Helmets. Narratives is defined as a recurring statement that is used to express a point of view. Stance is a high-level point of view on a topic. We demonstrate that narrative and stance together can provide a dynamic method for real world users, e.g., intelligence analysts, to quickly identify and counter disinformation campaigns based on their knowledge at the time.
2014
pdf
bib
Feature Selection for Highly Skewed Sentiment Analysis Tasks
Can Liu
|
Sandra Kübler
|
Ning Yu
Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP)
pdf
bib
“My Curiosity was Satisfied, but not in a Good Way”: Predicting User Ratings for Online Recipes
Can Liu
|
Chun Guo
|
Daniel Dakota
|
Sridhar Rajagopalan
|
Wen Li
|
Sandra Kübler
|
Ning Yu
Proceedings of the Second Workshop on Natural Language Processing for Social Media (SocialNLP)
2011
pdf
bib
Filling the Gap: Semi-Supervised Learning for Opinion Detection Across Domains
Ning Yu
|
Sandra Kübler
Proceedings of the Fifteenth Conference on Computational Natural Language Learning