2024
pdf
bib
abs
FASTTRACK: Reliable Fact Tracing via Clustering and LLM-Powered Evidence Validation
Si Chen
|
Feiyang Kang
|
Ning Yu
|
Ruoxi Jia
Findings of the Association for Computational Linguistics: EMNLP 2024
Fact tracing seeks to identify specific training examples that serve as the knowledge source for a given query. Existing approaches to fact tracing rely on assessing the similarity between each training sample and the query along a certain dimension, such as lexical similarity, gradient, or embedding space. However, these methods fall short of effectively distinguishing between samples that are merely relevant and those that actually provide supportive evidence for the information sought by the query. This limitation often results in suboptimal effectiveness. Moreover, these approaches necessitate the examination of the similarity of individual training points for each query, imposing significant computational demands and creating a substantial barrier for practical applications. This paper introduces FASTTRACK, a novel approach that harnesses the capabilities of Large Language Models (LLMs) to validate supportive evidence for queries and at the same time clusters the training database towards a reduced extent for LLMs to trace facts. Our experiments show that FASTTRACK substantially outperforms existing methods in both accuracy and efficiency, achieving more than 100% improvement in F1 score over the state-of-the-art methods while being x33 faster than TracIn.
2023
pdf
bib
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
Chu-Ren Huang
|
Yasunari Harada
|
Jong-Bok Kim
|
Si Chen
|
Yu-Yin Hsu
|
Emmanuele Chersoni
|
Pranav A
|
Winnie Huiheng Zeng
|
Bo Peng
|
Yuxi Li
|
Junlin Li
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
2022
pdf
bib
abs
Just Fine-tune Twice: Selective Differential Privacy for Large Language Models
Weiyan Shi
|
Ryan Shea
|
Si Chen
|
Chiyuan Zhang
|
Ruoxi Jia
|
Zhou Yu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Protecting large language models from privacy leakage is becoming increasingly crucial with their wide adoption in real-world products. Yet applying *differential privacy* (DP), a canonical notion with provable privacy guarantees for machine learning models, to those models remains challenging due to the trade-off between model utility and privacy loss. Utilizing the fact that sensitive information in language data tends to be sparse, Shi et al. (2021) formalized a DP notion extension called *Selective Differential Privacy* (SDP) to protect only the sensitive tokens defined by a policy function. However, their algorithm only works for RNN-based models. In this paper, we develop a novel framework, *Just Fine-tune Twice* (JFT), that achieves SDP for state-of-the-art large transformer-based models. Our method is easy to implement: it first fine-tunes the model with *redacted* in-domain data, and then fine-tunes it again with the *original* in-domain data using a private training mechanism. Furthermore, we study the scenario of imperfect implementation of policy functions that misses sensitive tokens and develop systematic methods to handle it. Experiments show that our method achieves strong utility compared to previous baselines. We also analyze the SDP privacy guarantee empirically with the canary insertion attack.
pdf
bib
abs
CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems
Yi Huang
|
Xiaoting Wu
|
Si Chen
|
Wei Hu
|
Qing Zhu
|
Junlan Feng
|
Chao Deng
|
Zhijian Ou
|
Jiangjiang Zhao
Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD)
Dialogue modeling problems severely limit the real-world deployment of neural conversational models and building a human-like dialogue agent is an extremely challenging task. Recently, data-driven models become more and more prevalent which need a huge amount of conversation data. In this paper, we release around 100,000 dialogue, which come from real-world dialogue transcripts between real users and customer-service staffs. We call this dataset as CMCC (China Mobile Customer Care) dataset, which differs from existing dialogue datasets in both size and nature significantly. The dataset reflects several characteristics of human-human conversations, e.g., task-driven, care-oriented, and long-term dependency among the context. It also covers various dialogue types including task-oriented, chitchat and conversational recommendation in real-world scenarios. To our knowledge, CMCC is the largest real human-human spoken dialogue dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of dialogue modeling methods. The results of extensive experiments indicate that CMCC is challenging and needs further effort. We hope that this resource will allow for more effective models across various dialogue sub-problems to be built in the future.
2020
pdf
bib
Marking Trustworthiness with Near Synonyms: A Corpus-based Study of “Renwei” and “Yiwei” in Chinese
Bei Li
|
Chu-Ren Huang
|
Si Chen
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
A structure-enhanced graph convolutional network for sentiment analysis
Fanyu Meng
|
Junlan Feng
|
Danping Yin
|
Si Chen
|
Min Hu
Findings of the Association for Computational Linguistics: EMNLP 2020
Syntactic information is essential for both sentiment analysis(SA) and aspect-based sentiment analysis(ABSA). Previous work has already achieved great progress utilizing Graph Convolutional Network(GCN) over dependency tree of a sentence. However, these models do not fully exploit the syntactic information obtained from dependency parsing such as the diversified types of dependency relations. The message passing process of GCN should be distinguished based on these syntactic information. To tackle this problem, we design a novel weighted graph convolutional network(WGCN) which can exploit rich syntactic information based on the feature combination. Furthermore, we utilize BERT instead of Bi-LSTM to generate contextualized representations as inputs for GCN and present an alignment method to keep word-level dependencies consistent with wordpiece unit of BERT. With our proposal, we are able to improve the state-of-the-art on four ABSA tasks out of six and two SA tasks out of three.
2018
pdf
bib
Effects of Stimulus Duration and Vowel Quality in Tone Perception by English Musicians and Non-musicians
Si Chen
|
Yiqing Zhu
|
Ratree Wayland
|
Yike Yang
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
pdf
bib
Perceptual evaluation of Mandarin tone sandhi production by Cantonese speakers before and after perceptual training
Bei Li
|
Yike Yang
|
Si Chen
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
pdf
bib
Semantic Transparency of Radicals in Chinese Characters: An Ontological Perspective
Yike Yang
|
Chu-Ren Huang
|
Sicong Dong
|
Si Chen
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation