David K. Evans

Also published as: David Evans, David Kirk Evans


2024

pdf bib
Addressing Both Statistical and Causal Gender Fairness in NLP Models
Hannah Chen | Yangfeng Ji | David Evans
Findings of the Association for Computational Linguistics: NAACL 2024

Statistical fairness stipulates equivalent outcomes for every protected group, whereas causal fairness prescribes that a model makes the same prediction for an individual regardless of their protected characteristics. Counterfactual data augmentation (CDA) is effective for reducing bias in NLP models, yet models trained with CDA are often evaluated only on metrics that are closely tied to the causal fairness notion; similarly, sampling-based methods designed to promote statistical fairness are rarely evaluated for causal fairness. In this work, we evaluate both statistical and causal debiasing methods for gender bias in NLP models, and find that while such methods are effective at reducing bias as measured by the targeted metric, they do not necessarily improve results on other bias metrics. We demonstrate that combinations of statistical and causal debiasing techniques are able to reduce bias measured through both types of metrics.

2022

pdf bib
Balanced Adversarial Training: Balancing Tradeoffs between Fickleness and Obstinacy in NLP Models
Hannah Chen | Yangfeng Ji | David Evans
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Traditional (fickle) adversarial examples involve finding a small perturbation that does not change an input’s true label but confuses the classifier into outputting a different prediction. Conversely, obstinate adversarial examples occur when an adversary finds a small perturbation that preserves the classifier’s prediction but changes the true label of an input.Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to fickle adversarial examples. We show that standard adversarial training methods focused on reducing vulnerability to fickle adversarial examples may make a model more vulnerable to obstinate adversarial examples, with experiments for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce Balanced Adversarial Training, which incorporates contrastive learning to increase robustness against both fickle and obstinate adversarial examples.

pdf bib
An Empirical Analysis of Memorization in Fine-tuned Autoregressive Language Models
Fatemehsadat Mireshghallah | Archit Uniyal | Tianhao Wang | David Evans | Taylor Berg-Kirkpatrick
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large language models are shown to present privacy risks through memorization of training data, andseveral recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the “pre-train and fine-tune” paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.

2020

pdf bib
Pointwise Paraphrase Appraisal is Potentially Problematic
Hannah Chen | Yangfeng Ji | David Evans
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

The prevailing approach for training and evaluating paraphrase identification models is constructed as a binary classification problem: the model is given a pair of sentences, and is judged by how accurately it classifies pairs as either paraphrases or non-paraphrases. This pointwise-based evaluation method does not match well the objective of most real world applications, so the goal of our work is to understand how models which perform well under pointwise evaluation may fail in practice and find better methods for evaluating paraphrase identification models. As a first step towards that goal, we show that although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Moreover, we show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.

pdf bib
Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory
Hannah Chen | Yangfeng Ji | David Evans
Findings of the Association for Computational Linguistics: EMNLP 2020

Most NLP datasets are manually labeled, so suffer from inconsistent labeling or limited size. We propose methods for automatically improving datasets by viewing them as graphs with expected semantic properties. We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We use structural balance theory to identify likely mislabelings in the graph, and flip their labels. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.

2008

pdf bib
A Japanese-English Technical Lexicon for Translation and Language Research
Fredric Gey | David Kirk Evans | Noriko Kando
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present a Japanese-English Bilingual lexicon of technical terms. The lexicon was derived from the first and second NTCIR evaluation collections for research into cross-language information retrieval for Asian languages. While it can be utilized for translation between Japanese and English, the lexicon is also suitable for language research and language engineering. Since it is collection-derived, it contains instances of word variants and miss-spellings which make it eminently suitable for further research. For a subset of the lexicon we make available the collection statistics. In addition we make available a Katakana subset suitable for transliteration research.

2004

pdf bib
Columbia Newsblaster: Multilingual News Summarization on the Web
David Kirk Evans | Judith L. Klavans | Kathleen R. McKeown
Demonstration Papers at HLT-NAACL 2004

2003

pdf bib
Columbia’s Newsblaster: New Features and Future Directions
Kathleen McKeown | Regina Barzilay | John Chen | David Elson | David Evans | Judith Klavans | Ani Nenkova | Barry Schiffman | Sergey Sigelman
Companion Volume of the Proceedings of HLT-NAACL 2003 - Demonstrations

2000

pdf bib
Evaluation of Automatically Identified Index Terms for Browsing Electronic Documents
Nina Wacholder | Judith L. Klavans | David K. Evans
Sixth Applied Natural Language Processing Conference

pdf bib
Evaluation of Computational Linguistic Techniques for Identifying Significant Topics for Browsing Applications
Judith L. Klavans | Nina Wacholder | David K. Evans
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)