2024
pdf
bib
abs
CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks
Nikita Haduong
|
Irene Wang
|
Bo-Ru Lu
|
Prithviraj Ammanabrolu
|
Noah A. Smith
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, through textual records of team interactions, is hindered by a major data challenge: available CPS corpora are predominantly dyadic, and adapting pre-existing CPS tasks to more agents is non-trivial. We address this data challenge by developing a CPS task generator, CPS-TaskForge, that can produce environments for studying CPS under a wide array of conditions, and releasing a CPS task design checklist grounded in the theoretical PISA 2015 CPS framework to help facilitate the development of CPS corpora with more agents. CPS-TaskForge takes the form of a resource management (tower defense) game, and different CPS tasks can be studied by manipulating game design parameters. We conduct a case study with groups of 3–4 humans to validate production of diverse natural language CPS communication in a game instance produced by CPS-TaskForge. We discuss opportunities for advancing research in CPS (both with human-only and human-AI teams) using different task configurations. We release all data and code.
2023
pdf
bib
abs
Risks and NLP Design: A Case Study on Procedural Document QA
Nikita Haduong
|
Alice Gao
|
Noah A. Smith
Findings of the Association for Computational Linguistics: ACL 2023
As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users—and concrete strategies to mitigate them—will be possible when we specialize the analysis to more concrete applications and their plausible users. As an illustration, this paper is grounded in cooking recipe procedural document question answering (ProcDocQA), where there are well-defined risks to users such as injuries or allergic reactions. Our case study shows that an existing language model, applied in “zero-shot” mode, quantitatively answers real-world questions about recipes as well or better than the humans who have answered the questions on the web. Using a novel questionnaire informed by theoretical work on AI risk, we conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.
2021
pdf
bib
abs
All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark
|
Tal August
|
Sofia Serrano
|
Nikita Haduong
|
Suchin Gururangan
|
Noah A. Smith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Human evaluations are typically considered the gold standard in natural language generation, but as models’ fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts’ ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators’ accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.
2016
pdf
bib
IUCL at SemEval-2016 Task 6: An Ensemble Model for Stance Detection in Twitter
Can Liu
|
Wen Li
|
Bradford Demarest
|
Yue Chen
|
Sara Couture
|
Daniel Dakota
|
Nikita Haduong
|
Noah Kaufman
|
Andrew Lamont
|
Manan Pancholi
|
Kenneth Steimel
|
Sandra Kübler
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)