Programmable Annotation with Diversed Heuristics and Data Denoising

Ernie Chang, Alex Marin, Vera Demberg


Abstract
Neural natural language generation (NLG) and understanding (NLU) models are costly and require massive amounts of annotated data to be competitive. Recent data programming frameworks address this bottleneck by allowing human supervision to be provided as a set of labeling functions to construct generative models that synthesize weak labels at scale. However, these labeling functions are difficult to build from scratch for NLG/NLU models, as they often require complex rule sets to be specified. To this end, we propose a novel data programming framework that can jointly construct labeled data for language generation and understanding tasks – by allowing the annotators to modify an automatically-inferred alignment rule set between sequence labels and text, instead of writing rules from scratch. Further, to mitigate the effect of poor quality labels, we propose a dually-regularized denoising mechanism for optimizing the NLU and NLG models. On two benchmarks we show that the framework can generate high-quality data that comes within a 1.48 BLEU and 6.42 slot F1 of the 100% human-labeled data (42k instances) with just 100 labeled data samples – outperforming benchmark annotation frameworks and other semi-supervised approaches.
Anthology ID:
2022.coling-1.237
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Editors:
Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
2681–2691
Language:
URL:
https://aclanthology.org/2022.coling-1.237
DOI:
Bibkey:
Cite (ACL):
Ernie Chang, Alex Marin, and Vera Demberg. 2022. Programmable Annotation with Diversed Heuristics and Data Denoising. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2681–2691, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
Programmable Annotation with Diversed Heuristics and Data Denoising (Chang et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.237.pdf