Lining Zhang


2023

pdf bib
A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization
Lining Zhang | Simon Mille | Yufang Hou | Daniel Deutsch | Elizabeth Clark | Yixin Liu | Saad Mahamood | Sebastian Gehrmann | Miruna Clinciu | Khyathi Raghavi Chandu | João Sedoc
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

To prevent the costly and inefficient use of resources on low-quality annotations, we want a method for creating a pool of dependable annotators who can effectively complete difficult tasks, such as evaluating automatic summarization. Thus, we investigate the recruitment of high-quality Amazon Mechanical Turk workers via a two-step pipeline. We show that we can successfully filter out subpar workers before they carry out the evaluations and obtain high-agreement annotations with similar constraints on resources. Although our workers demonstrate a strong consensus among themselves and CloudResearch workers, their alignment with expert judgments on a subset of the data is not as expected and needs further training in correctness. This paper still serves as a best practice for the recruitment of qualified annotators in other challenging annotation tasks.

pdf bib
Common Law Annotations: Investigating the Stability of Dialog System Output Annotations
Seunggun Lee | Alexandra DeLucia | Nikita Nangia | Praneeth Ganedi | Ryan Guan | Rubing Li | Britney Ngaw | Aditya Singhal | Shalaka Vaidya | Zijun Yuan | Lining Zhang | João Sedoc
Findings of the Association for Computational Linguistics: ACL 2023

Metrics for Inter-Annotator Agreement (IAA), like Cohen’s Kappa, are crucial for validating annotated datasets. Although high agreement is often used to show the reliability of annotation procedures, it is insufficient to ensure or reproducibility. While researchers are encouraged to increase annotator agreement, this can lead to specific and tailored annotation guidelines. We hypothesize that this may result in diverging annotations from different groups. To study this, we first propose the Lee et al. Protocol (LEAP), a standardized and codified annotation protocol. LEAP strictly enforces transparency in the annotation process, which ensures reproducibility of annotation guidelines. Using LEAP to annotate a dialog dataset, we empirically show that while research groups may create reliable guidelines by raising agreement, this can cause divergent annotations across different research groups, thus questioning the validity of the annotations. Therefore, we caution NLP researchers against using reliability as a proxy for reproducibility and validity.

2022

pdf bib
Probing GPT-3’s Linguistic Knowledge on Semantic Tasks
Lining Zhang | Mengchen Wang | Liben Chen | Wenxin Zhang
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

GPT-3 has attracted much attention from both academia and industry. However, it is still unclear what GPT-3 has understood or learned especially in linguistic knowledge. Some studies have shown linguistic phenomena including negation and tense are hard to be recognized by language models such as BERT. In this study, we conduct probing tasks focusing on semantic information. Specifically, we investigate GPT-3’s linguistic knowledge on semantic tasks to identify tense, the number of subjects, and the number of objects for a given sentence. We also experiment with different prompt designs and temperatures of the decoding method. Our experiment results suggest that GPT-3 has acquired linguistic knowledge to identify certain semantic information in most cases, but still fails when there are some types of disturbance happening in the sentence. We also perform error analysis to summarize some common types of mistakes that GPT-3 has made when dealing with certain semantic information.