Workshop on Evaluation and Comparison of NLP Systems (2022)

A Japanese Corpus of Many Specialized Domains for Word Segmentation and Part-of-Speech Tagging
Shohei Higashiyama | Masao Ideuchi | Masao Utiyama | Yoshiaki Oida | Eiichiro Sumita

Assessing Resource-Performance Trade-off of Natural Language Models using Data Envelopment Analysis
Zachary Zhou | Alisha Zachariah | Devin Conathan | Jeffery Kline

From COMET to COMES – Can Summary Evaluation Benefit from Translation Evaluation?
Mateusz Krubiński | Pavel Pecina

Better Smatch = Better Parser? AMR evaluation is not so simple anymore
Juri Opitz | Anette Frank

GLARE: Generative Left-to-right AdversaRial Examples
Ryan Andrew Chi | Nathan Kim | Patrick Liu | Zander Lack | Ethan A Chi

Random Text Perturbations Work, but not Always
Zhengxiang Wang

A Comparative Analysis of Stance Detection Approaches and Datasets
Parush Gera | Tempestt Neal

Why is sentence similarity benchmark not predictive of application-oriented task performance?
Kaori Abe | Sho Yokoi | Tomoyuki Kajiwara | Kentaro Inui

Evaluating the role of non-lexical markers in GPT-2’s language modeling behavior
Roberta Rocca | Alejandro de la Vega

Assessing Neural Referential Form Selectors on a Realistic Multilingual Dataset
Guanyi Chen | Fahime Same | Kees Van Deemter