Findings of the Association for Computational Linguistics (2021)


up

pdf (full)
bib (full)
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli

pdf bib
Explainable Inference Over Grounding-Abstract Chains for Science Questions
Mokanarangan Thayaparan | Marco Valentino | André Freitas

pdf bib
LV-BERT: Exploiting Layer Variety for BERT
Weihao Yu | Zihang Jiang | Fei Chen | Qibin Hou | Jiashi Feng

pdf bib
Few-Shot Event Detection with Prototypical Amortized Conditional Random Field
Xin Cong | Shiyao Cui | Bowen Yu | Tingwen Liu | Wang Yubin | Bin Wang

pdf bib
LUX (Linguistic aspects Under eXamination): Discourse Analysis for Automatic Fake News Classification
Lucas Azevedo | Mathieu d’Aquin | Brian Davis | Manel Zarrouk

pdf bib
Diagnosing Transformers in Task-Oriented Semantic Parsing
Shrey Desai | Ahmed Aly

pdf bib
Semantic Relation-aware Difference Representation Learning for Change Captioning
Yunbin Tu | Tingting Yao | Liang Li | Jiedong Lou | Shengxiang Gao | Zhengtao Yu | Chenggang Yan

pdf bib
The Authors Matter: Understanding and Mitigating Implicit Bias in Deep Text Classification
Haochen Liu | Wei Jin | Hamid Karimi | Zitao Liu | Jiliang Tang

pdf bib
From What to Why: Improving Relation Extraction with Rationale Graph
Zhenyu Zhang | Bowen Yu | Xiaobo Shu | Xue Mengge | Tingwen Liu | Li Guo

pdf bib
More Parameters? No Thanks!
Zeeshan Khan | Kartheek Akella | Vinay Namboodiri | C V Jawahar

pdf bib
SyGNS: A Systematic Generalization Testbed Based on Natural Language Semantics
Hitomi Yanaka | Koji Mineshima | Kentaro Inui

pdf bib
Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade
Jiatao Gu | Xiang Kong

pdf bib
Generate, Prune, Select: A Pipeline for Counterspeech Generation against Online Hate Speech
Wanzheng Zhu | Suma Bhat

pdf bib
REPT: Bridging Language Models and Machine Reading Comprehension via Retrieval-Based Pre-training
Fangkai Jiao | Yangyang Guo | Yilin Niu | Feng Ji | Feng-Lin Li | Liqiang Nie

pdf bib
CasEE: A Joint Learning Framework with Cascade Decoding for Overlapping Event Extraction
Jiawei Sheng | Shu Guo | Bowen Yu | Qian Li | Yiming Hei | Lihong Wang | Tingwen Liu | Hongbo Xu

pdf bib
Discovering Topics in Long-tailed Corpora with Causal Intervention
Xiaobao Wu | Chunping Li | Yishu Miao

pdf bib
More than just Frequency? Demasking Unsupervised Hypernymy Prediction Methods
Thomas Bott | Dominik Schlechtweg | Sabine Schulte im Walde

pdf bib
WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections
Mingda Chen | Sam Wiseman | Kevin Gimpel

pdf bib
CoDesc: A Large Code–Description Parallel Dataset
Masum Hasan | Tanveer Muttaqueen | Abdullah Al Ishtiaq | Kazi Sajeed Mehrab | Md. Mahim Anjum Haque | Tahmid Hasan | Wasi Ahmad | Anindya Iqbal | Rifat Shahriyar

pdf bib
Deep Cognitive Reasoning Network for Multi-hop Question Answering over Knowledge Graphs
Jianyu Cai | Zhanqiu Zhang | Feng Wu | Jie Wang

pdf bib
GoG: Relation-aware Graph-over-Graph Network for Visual Dialog
Feilong Chen | Xiuyi Chen | Fandong Meng | Peng Li | Jie Zhou

pdf bib
Joint Optimization of Tokenization and Downstream Model
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki

pdf bib
How does Attention Affect the Model?
Cheng Zhang | Qiuchi Li | Lingyu Hua | Dawei Song

pdf bib
Contrastive Attention for Automatic Chest X-ray Report Generation
Fenglin Liu | Changchang Yin | Xian Wu | Shen Ge | Ping Zhang | Xu Sun

pdf bib
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning
Fenglin Liu | Xuancheng Ren | Xian Wu | Bang Yang | Shen Ge | Xu Sun

pdf bib
Better Chinese Sentence Segmentation with Reinforcement Learning
Srivatsan Srinivasan | Chris Dyer

pdf bib
Enhancing Transformers with Gradient Boosted Decision Trees for NLI Fine-Tuning
Benjamin Minixhofer | Milan Gritta | Ignacio Iacobacci

pdf bib
Empirical Error Modeling Improves Robustness of Noisy Neural Sequence Labeling
Marcin Namysl | Sven Behnke | Joachim Köhler

pdf bib
Spatial Dependency Parsing for Semi-Structured Document Information Extraction
Wonseok Hwang | Jinyeong Yim | Seunghyun Park | Sohee Yang | Minjoon Seo

pdf bib
Reader-Guided Passage Reranking for Open-Domain Question Answering
Yuning Mao | Pengcheng He | Xiaodong Liu | Yelong Shen | Jianfeng Gao | Jiawei Han | Weizhu Chen

pdf bib
Entity-Aware Abstractive Multi-Document Summarization
Hao Zhou | Weidong Ren | Gongshen Liu | Bo Su | Wei Lu

pdf bib
LenAtten: An Effective Length Controlling Unit For Text Summarization
Zhongyi Yu | Zhenghao Wu | Hao Zheng | Zhe XuanYuan | Jefferson Fong | Weifeng Su

pdf bib
XeroAlign: Zero-shot cross-lingual transformer alignment
Milan Gritta | Ignacio Iacobacci

pdf bib
Using Word Embeddings to Analyze Teacher Evaluations: An Application to a Filipino Education Non-Profit Organization
Francesca Vera

pdf bib
Relation Classification with Entity Type Restriction
Shengfei Lyu | Huanhuan Chen

pdf bib
Link Prediction on N-ary Relational Facts: A Graph-based Approach
Quan Wang | Haifeng Wang | Yajuan Lyu | Yong Zhu

pdf bib
GLGE: A New General Language Generation Evaluation Benchmark
Dayiheng Liu | Yu Yan | Yeyun Gong | Weizhen Qi | Hang Zhang | Jian Jiao | Weizhu Chen | Jie Fu | Linjun Shou | Ming Gong | Pengcheng Wang | Jiusheng Chen | Daxin Jiang | Jiancheng Lv | Ruofei Zhang | Winnie Wu | Ming Zhou | Nan Duan

pdf bib
AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization
Xinsong Zhang | Pengshuai Li | Hang Li

pdf bib
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation
Feilong Chen | Fandong Meng | Xiuyi Chen | Peng Li | Jie Zhou

pdf bib
Retrieve & Memorize: Dialog Policy Learning with Multi-Action Memory
YunHao Li | Yunyi Yang | Xiaojun Quan | Jianxing Yu

pdf bib
Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains
Yunzhi Yao | Shaohan Huang | Wenhui Wang | Li Dong | Furu Wei

pdf bib
Decoupling Adversarial Training for Fair NLP
Xudong Han | Timothy Baldwin | Trevor Cohn

pdf bib
GO FIGURE: A Meta Evaluation of Factuality in Summarization
Saadia Gabriel | Asli Celikyilmaz | Rahul Jha | Yejin Choi | Jianfeng Gao

pdf bib
DNN-driven Gradual Machine Learning for Aspect-term Sentiment Analysis
Murtadha Ahmed | Qun Chen | Yanyan Wang | Youcef Nafa | Zhanhuai Li | Tianyi Duan

pdf bib
Error Detection in Large-Scale Natural Language Understanding Systems Using Transformer Models
Rakesh Chada | Pradeep Natarajan | Darshan Fofadiya | Prathap Ramachandra

pdf bib
OutFlip: Generating Examples for Unknown Intent Detection with Natural Language Attack
DongHyun Choi | Myeong Cheol Shin | EungGyun Kim | Dong Ryeol Shin

pdf bib
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning
Jiaqi Chen | Jianheng Tang | Jinghui Qin | Xiaodan Liang | Lingbo Liu | Eric Xing | Liang Lin

pdf bib
SIRE: Separate Intra- and Inter-sentential Reasoning for Document-level Relation Extraction
Shuang Zeng | Yuting Wu | Baobao Chang

pdf bib
KGPool: Dynamic Knowledge Graph Context Selection for Relation Extraction
Abhishek Nadgeri | Anson Bastos | Kuldeep Singh | Isaiah Onando Mulang’ | Johannes Hoffart | Saeedeh Shekarpour | Vijay Saraswat

pdf bib
Better Combine Them Together! Integrating Syntactic Constituency and Dependency Representations for Semantic Role Labeling
Hao Fei | Shengqiong Wu | Yafeng Ren | Fei Li | Donghong Ji

pdf bib
Keep the Primary, Rewrite the Secondary: A Two-Stage Approach for Paraphrase Generation
Yixuan Su | David Vandyke | Simon Baker | Yan Wang | Nigel Collier

pdf bib
Contrastive Fine-tuning Improves Robustness for Neural Rankers
Xiaofei Ma | Cicero Nogueira dos Santos | Andrew O. Arnold

pdf bib
Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking
Elliot Schumacher | James Mayfield | Mark Dredze

pdf bib
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
Yash Kumar Lal | Nathanael Chambers | Raymond Mooney | Niranjan Balasubramanian

pdf bib
Dialogue in the Wild: Learning from a Deployed Role-Playing Game with Humans and Bots
Kurt Shuster | Jack Urbanek | Emily Dinan | Arthur Szlam | Jason Weston

pdf bib
Deep Learning against COVID-19: Respiratory Insufficiency Detection in Brazilian Portuguese Speech
Edresson Casanova | Lucas Gris | Augusto Camargo | Daniel da Silva | Murilo Gazzola | Ester Sabino | Anna Levin | Arnaldo Candido Jr | Sandra Aluisio | Marcelo Finger

pdf bib
Benchmarking Robustness of Machine Reading Comprehension Models
Chenglei Si | Ziqing Yang | Yiming Cui | Wentao Ma | Ting Liu | Shijin Wang

pdf bib
Improving BERT with Syntax-aware Local Attention
Zhongli Li | Qingyu Zhou | Chao Li | Ke Xu | Yunbo Cao

pdf bib
A Dialogue-based Information Extraction System for Medical Insurance Assessment
Shuang Peng | Mengdi Zhou | Minghui Yang | Haitao Mi | Shaosheng Cao | Zujie Wen | Teng Xu | Hongbin Wang | Lei Liu

pdf bib
Prediction or Comparison: Toward Interpretable Qualitative Reasoning
Mucheng Ren | Heyan Huang | Yang Gao

pdf bib
Boundary Detection with BERT for Span-level Emotion Cause Analysis
Xiangju Li | Wei Gao | Shi Feng | Yifei Zhang | Daling Wang

pdf bib
On Commonsense Cues in BERT for Solving Commonsense Tasks
Leyang Cui | Sijie Cheng | Yu Wu | Yue Zhang

pdf bib
Weakly Supervised Pre-Training for Multi-Hop Retriever
Yeon Seonwoo | Sang-Woo Lee | Ji-Hoon Kim | Jung-Woo Ha | Alice Oh

pdf bib
Meet The Truth: Leverage Objective Facts and Subjective Views for Interpretable Rumor Detection
Jiawen Li | Shiwen Ni | Hung-Yu Kao

pdf bib
Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking
Heng-Da Xu | Zhongli Li | Qingyu Zhou | Chao Li | Zizhen Wang | Yunbo Cao | Heyan Huang | Xian-Ling Mao

pdf bib
TransSum: Translating Aspect and Sentiment Embeddings for Self-Supervised Opinion Summarization
Ke Wang | Xiaojun Wan

pdf bib
Hashing based Efficient Inference for Image-Text Matching
Rong-Cheng Tu | Lei Ji | Huaishao Luo | Botian Shi | Heyan Huang | Nan Duan | Xian-Ling Mao

pdf bib
Can the Transformer Learn Nested Recursion with Symbol Masking?
Jean-Philippe Bernardy | Adam Ek | Vladislav Maraev

pdf bib
Rationalization through Concepts
Diego Antognini | Boi Faltings

pdf bib
Parallel Attention Network with Sequence Matching for Video Grounding
Hao Zhang | Aixin Sun | Wei Jing | Liangli Zhen | Joey Tianyi Zhou | Siow Mong Rick Goh

pdf bib
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
Mingliang Zeng | Xu Tan | Rui Wang | Zeqian Ju | Tao Qin | Tie-Yan Liu

pdf bib
Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto | Jey Han Lau | Timothy Baldwin

pdf bib
CoMAE: A Multi-factor Hierarchical Framework for Empathetic Response Generation
Chujie Zheng | Yong Liu | Wei Chen | Yongcai Leng | Minlie Huang

pdf bib
UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction
Huanqin Wu | Wei Liu | Lei Li | Dan Nie | Tao Chen | Feng Zhang | Di Wang

pdf bib
As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages
Wietse de Vries | Malvina Nissim

pdf bib
Can Cognate Prediction Be Modelled as a Low-Resource Machine Translation Task?
Clémentine Fourrier | Rachel Bawden | Benoît Sagot

pdf bib
What if This Modified That? Syntactic Interventions with Counterfactual Embeddings
Mycal Tucker | Peng Qian | Roger Levy

pdf bib
Investigating Text Simplification Evaluation
Laura Vásquez-Rodríguez | Matthew Shardlow | Piotr Przybyła | Sophia Ananiadou

pdf bib
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences
Shikhar Singh | Nuan Wen | Yu Hou | Pegah Alipoormolabashi | Te-lin Wu | Xuezhe Ma | Nanyun Peng

pdf bib
Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech
Yi-Ling Chung | Serra Sinem Tekiroğlu | Marco Guerini

pdf bib
SOLID: A Large-Scale Semi-Supervised Dataset for Offensive Language Identification
Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Marcos Zampieri | Preslav Nakov

pdf bib
RealFormer: Transformer Likes Residual Attention
Ruining He | Anirudh Ravula | Bhargav Kanagal | Joshua Ainslie

pdf bib
Promoting Graph Awareness in Linearized Graph-to-Text Generation
Alexander Miserlis Hoyle | Ana Marasović | Noah A. Smith

pdf bib
Predicting cross-linguistic adjective order with information gain
William Dyer | Richard Futrell | Zoey Liu | Greg Scontras

pdf bib
A Survey of Data Augmentation Approaches for NLP
Steven Y. Feng | Varun Gangal | Jason Wei | Sarath Chandar | Soroush Vosoughi | Teruko Mitamura | Eduard Hovy

pdf bib
Why Machine Reading Comprehension Models Learn Shortcuts?
Yuxuan Lai | Chen Zhang | Yansong Feng | Quzhe Huang | Dongyan Zhao

pdf bib
Handling Cross- and Out-of-Domain Samples in Thai Word Segmentation
Peerat Limkonchotiwat | Wannaphong Phatthiyaphaibun | Raheem Sarwar | Ekapol Chuangsuwanich | Sarana Nutanong

pdf bib
Sensei: Self-Supervised Sensor Name Segmentation
Jiaman Wu | Dezhi Hong | Rajesh Gupta | Jingbo Shang

pdf bib
Frustratingly Simple Few-Shot Slot Tagging
Jianqiang Ma | Zeyu Yan | Chang Li | Yang Zhang

pdf bib
Medical Code Assignment with Gated Convolution and Note-Code Interaction
Shaoxiong Ji | Shirui Pan | Pekka Marttinen

pdf bib
Dynamic Semantic Graph Construction and Reasoning for Explainable Multi-hop Science Question Answering
Weiwen Xu | Huihui Zhang | Deng Cai | Wai Lam

pdf bib
Addressing Inquiries about History: An Efficient and Practical Framework for Evaluating Open-domain Chatbot Consistency
Zekang Li | Jinchao Zhang | Zhengcong Fei | Yang Feng | Jie Zhou

pdf bib
Investigating the Reordering Capability in CTC-based Non-Autoregressive End-to-End Speech Translation
Shun-Po Chuang | Yung-Sung Chuang | Chih-Chiang Chang | Hung-yi Lee

pdf bib
Code Summarization with Structure-induced Transformer
Hongqiu Wu | Hai Zhao | Min Zhang

pdf bib
Scheduled Dialog Policy Learning: An Automatic Curriculum Learning Framework for Task-oriented Dialog System
Sihong Liu | Jinchao Zhang | Keqing He | Weiran Xu | Jie Zhou

pdf bib
Do Explanations Help Users Detect Errors in Open-Domain QA? An Evaluation of Spoken vs. Visual Explanations
Ana Valeria González | Gagan Bansal | Angela Fan | Yashar Mehdad | Robin Jia | Srinivasan Iyer

pdf bib
OntoEA: Ontology-guided Entity Alignment via Joint Knowledge Graph Embedding
Yuejia Xiang | Ziheng Zhang | Jiaoyan Chen | Xi Chen | Zhenxi Lin | Yefeng Zheng

pdf bib
Learning Algebraic Recombination for Compositional Generalization
Chenyao Liu | Shengnan An | Zeqi Lin | Qian Liu | Bei Chen | Jian-Guang Lou | Lijie Wen | Nanning Zheng | Dongmei Zhang

pdf bib
Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks?
Thang Pham | Trung Bui | Long Mai | Anh Nguyen

pdf bib
RevCore: Review-Augmented Conversational Recommendation
Yu Lu | Junwei Bao | Yan Song | Zichen Ma | Shuguang Cui | Youzheng Wu | Xiaodong He

pdf bib
Awakening Latent Grounding from Pretrained Language Models for Semantic Parsing
Qian Liu | Dejian Yang | Jiahui Zhang | Jiaqi Guo | Bin Zhou | Jian-Guang Lou

pdf bib
Enhancing Label Correlation Feedback in Multi-Label Text Classification via Multi-Task Learning
Ximing Zhang | Qian-Wen Zhang | Zhao Yan | Ruifang Liu | Yunbo Cao

pdf bib
Fusing Context Into Knowledge Graph for Commonsense Question Answering
Yichong Xu | Chenguang Zhu | Ruochen Xu | Yang Liu | Michael Zeng | Xuedong Huang

pdf bib
Unsupervised Energy-based Adversarial Domain Adaptation for Cross-domain Text Classification
Han Zou | Jianfei Yang | Xiaojian Wu

pdf bib
Survival text regression for time-to-event prediction in conversations
Christine De Kock | Andreas Vlachos

pdf bib
Unsupervised Knowledge Selection for Dialogue Generation
Xiuyi Chen | Feilong Chen | Fandong Meng | Peng Li | Jie Zhou

pdf bib
Minimax and Neyman–Pearson Meta-Learning for Outlier Languages
Edoardo Maria Ponti | Rahul Aralikatte | Disha Shrivastava | Siva Reddy | Anders Søgaard

pdf bib
On-the-Fly Attention Modulation for Neural Generation
Yue Dong | Chandra Bhagavatula | Ximing Lu | Jena D. Hwang | Antoine Bosselut | Jackie Chi Kit Cheung | Yejin Choi

pdf bib
Grammar-Constrained Neural Semantic Parsing with LR Parsers
Artur Baranowski | Nico Hochgeschwender

pdf bib
Enhanced Metaphor Detection via Incorporation of External Knowledge Based on Linguistic Theories
Chang Su | Kechun Wu | Yijiang Chen

pdf bib
Controlling Text Edition by Changing Answers of Specific Questions
Lei Sha | Patrick Hohenecker | Thomas Lukasiewicz

pdf bib
Grammar-Based Patches Generation for Automated Program Repair
Yu Tang | Long Zhou | Ambrosio Blanco | Shujie Liu | Furu Wei | Ming Zhou | Muyun Yang

pdf bib
Manual Evaluation Matters: Reviewing Test Protocols of Distantly Supervised Relation Extraction
Tianyu Gao | Xu Han | Yuzhuo Bai | Keyue Qiu | Zhiyu Xie | Yankai Lin | Zhiyuan Liu | Peng Li | Maosong Sun | Jie Zhou

pdf bib
GCRC: A New Challenging MRC Dataset from Gaokao Chinese for Explainable Evaluation
Hongye Tan | Xiaoyue Wang | Yu Ji | Ru Li | Xiaoli Li | Zhiwei Hu | Yunxiao Zhao | Xiaoqi Han

pdf bib
Zero-shot Label-Aware Event Trigger and Argument Classification
Hongming Zhang | Haoyu Wang | Dan Roth

pdf bib
Incorporating Global Information in Local Attention for Knowledge Representation Learning
Yu Zhao | Han Zhou | Ruobing Xie | Fuzhen Zhuang | Qing Li | Ji Liu

pdf bib
Exploiting Position Bias for Robust Aspect Sentiment Classification
Fang Ma | Chen Zhang | Dawei Song

pdf bib
MRN: A Locally and Globally Mention-Based Reasoning Network for Document-Level Relation Extraction
Jingye Li | Kang Xu | Fei Li | Hao Fei | Yafeng Ren | Donghong Ji

pdf bib
Adversary-Aware Rumor Detection
Yun-Zhu Song | Yi-Syuan Chen | Yi-Ting Chang | Shao-Yu Weng | Hong-Han Shuai

pdf bib
LICHEE: Improving Language Model Pre-training with Multi-grained Tokenization
Weidong Guo | Mingjun Zhao | Lusheng Zhang | Di Niu | Jinwen Luo | Zhenhua Liu | Zhenyang Li | Jianbo Tang

pdf bib
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou | Graham Neubig | Jiatao Gu | Mona Diab | Francisco Guzmán | Luke Zettlemoyer | Marjan Ghazvininejad

pdf bib
K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters
Ruize Wang | Duyu Tang | Nan Duan | Zhongyu Wei | Xuanjing Huang | Jianshu Ji | Guihong Cao | Daxin Jiang | Ming Zhou

pdf bib
Global Attention Decoder for Chinese Spelling Error Correction
Zhao Guo | Yuan Ni | Keqiang Wang | Wei Zhu | Guotong Xie

pdf bib
Jointly Identifying Rhetoric and Implicit Emotions via Multi-Task Learning
Xin Chen | Zhen Hai | Deyu Li | Suge Wang | Dian Wang

pdf bib
Exploring the Role of Context in Utterance-level Emotion, Act and Intent Classification in Conversations: An Empirical Study
Deepanway Ghosal | Navonil Majumder | Rada Mihalcea | Soujanya Poria

pdf bib
Encouraging Neural Machine Translation to Satisfy Terminology Constraints
Melissa Ailem | Jingshu Liu | Raheel Qader

pdf bib
BertGCN: Transductive Text Classification by Combining GNN and BERT
Yuxiao Lin | Yuxian Meng | Xiaofei Sun | Qinghong Han | Kun Kuang | Jiwei Li | Fei Wu

pdf bib
Putting words into the system’s mouth: A targeted attack on neural machine translation using monolingual data poisoning
Jun Wang | Chang Xu | Francisco Guzmán | Ahmed El-Kishky | Yuqing Tang | Benjamin Rubinstein | Trevor Cohn

pdf bib
Semantic and Syntactic Enhanced Aspect Sentiment Triplet Extraction
Zhexue Chen | Hong Huang | Bang Liu | Xuanhua Shi | Hai Jin

pdf bib
UserAdapter: Few-Shot User Learning in Sentiment Analysis
Wanjun Zhong | Duyu Tang | Jiahai Wang | Jian Yin | Nan Duan

pdf bib
PsyQA: A Chinese Dataset for Generating Long Counseling Text for Mental Health Support
Hao Sun | Zhenru Lin | Chujie Zheng | Siyang Liu | Minlie Huang

pdf bib
RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
Bill Yuchen Lin | Ziyi Wu | Yichi Yang | Dong-Ho Lee | Xiang Ren

pdf bib
Learning to Generate Questions by Learning to Recover Answer-containing Sentences
Seohyun Back | Akhil Kedia | Sai Chetan Chinthakindi | Haejun Lee | Jaegul Choo

pdf bib
Learning Slice-Aware Representations with Mixture of Attentions
Cheng Wang | Sungjin Lee | Sunghyun Park | Han Li | Young-Bum Kim | Ruhi Sarikaya

pdf bib
Making Better Use of Bilingual Information for Cross-Lingual AMR Parsing
Yitao Cai | Zhe Lin | Xiaojun Wan

pdf bib
Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach
Zhe Lin | Xiaojun Wan

pdf bib
Few-shot Knowledge Graph-to-Text Generation with Pretrained Language Models
Junyi Li | Tianyi Tang | Wayne Xin Zhao | Zhicheng Wei | Nicholas Jing Yuan | Ji-Rong Wen

pdf bib
Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning
Chenglei Si | Zhengyan Zhang | Fanchao Qi | Zhiyuan Liu | Yasheng Wang | Qun Liu | Maosong Sun

pdf bib
NAST: A Non-Autoregressive Generator with Word Alignment for Unsupervised Text Style Transfer
Fei Huang | Zikai Chen | Chen Henry Wu | Qihan Guo | Xiaoyan Zhu | Minlie Huang

pdf bib
HyKnow: End-to-End Task-Oriented Dialog Modeling with Hybrid Knowledge Management
Silin Gao | Ryuichi Takanobu | Wei Peng | Qun Liu | Minlie Huang

pdf bib
Target-oriented Fine-tuning for Zero-Resource Named Entity Recognition
Ying Zhang | Fandong Meng | Yufeng Chen | Jinan Xu | Jie Zhou

pdf bib
BERT-Defense: A Probabilistic Model Based on BERT to Combat Cognitively Inspired Orthographic Adversarial Attacks
Yannik Keller | Jan Mackensen | Steffen Eger

pdf bib
Event Detection as Graph Parsing
Jianye Xie | Haotong Sun | Junsheng Zhou | Weiguang Qu | Xinyu Dai

pdf bib
Toward Fully Exploiting Heterogeneous Corpus:A Decoupled Named Entity Recognition Model with Two-stage Training
Yun Hu | Yeshuang Zhu | Jinchao Zhang | Changwen Zheng | Jie Zhou

pdf bib
Discriminative Reasoning for Document-level Relation Extraction
Wang Xu | Kehai Chen | Tiejun Zhao

pdf bib
Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text Classification
Chengcheng Han | Zeqiu Fan | Dongxiang Zhang | Minghui Qiu | Ming Gao | Aoying Zhou

pdf bib
Documents Representation via Generalized Coupled Tensor Chain with the Rotation Group constraint
Igor Vorona | Anh-Huy Phan | Alexander Panchenko | Andrzej Cichocki

pdf bib
Improving Unsupervised Extractive Summarization with Facet-Aware Modeling
Xinnian Liang | Shuangzhi Wu | Mu Li | Zhoujun Li

pdf bib
Improving Gradient-based Adversarial Training for Text Classification by Contrastive Learning and Auto-Encoder
Yao Qiu | Jinchao Zhang | Jie Zhou

pdf bib
Multi-Granularity Contrasting for Cross-Lingual Pre-Training
Shicheng Li | Pengcheng Yang | Fuli Luo | Jun Xie

pdf bib
A Comparison between Pre-training and Large-scale Back-translation for Neural Machine Translation
Dandan Huang | Kun Wang | Yue Zhang

pdf bib
Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene
Ruikun Luo | Guanhuan Huang | Xiaojun Quan

pdf bib
Fusing Label Embedding into BERT: An Efficient Improvement for Text Classification
Yijin Xiong | Yukun Feng | Hao Wu | Hidetaka Kamigaito | Manabu Okumura

pdf bib
KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion
Jie Zhou | Shengding Hu | Xin Lv | Cheng Yang | Zhiyuan Liu | Wei Xu | Jie Jiang | Juanzi Li | Maosong Sun

pdf bib
A Query-Driven Topic Model
Zheng Fang | Yulan He | Rob Procter

pdf bib
How Reliable are Model Diagnostics?
Vamsi Aribandi | Yi Tay | Donald Metzler

pdf bib
Gaussian Process based Deep Dyna-Q approach for Dialogue Policy Learning
Guanlin Wu | Wenqi Fang | Ji Wang | Jiang Cao | Weidong Bao | Yang Ping | Xiaomin Zhu | Zheng Wang

pdf bib
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding
Dustin Wright | Isabelle Augenstein

pdf bib
Cross-Lingual Cross-Domain Nested Named Entity Evaluation on English Web Texts
Barbara Plank

pdf bib
Counter-Argument Generation by Attacking Weak Premises
Milad Alshomary | Shahbaz Syed | Arkajit Dhar | Martin Potthast | Henning Wachsmuth

pdf bib
Alternated Training with Synthetic and Authentic Data for Neural Machine Translation
Rui Jiao | Zonghan Yang | Maosong Sun | Yang Liu

pdf bib
Template-Based Named Entity Recognition Using BART
Leyang Cui | Yu Wu | Jian Liu | Sen Yang | Yue Zhang

pdf bib
“Does it Matter When I Think You Are Lying?” Improving Deception Detection by Integrating Interlocutor’s Judgements in Conversations
Huang-Cheng Chou | Woan-Shiuan Chien | Da-Cheng Juan | Chi-Chun Lee

pdf bib
High-Quality Dialogue Diversification by Intermittent Short Extension Ensembles
Zhiwen Tang | Hrishikesh Kulkarni | Grace Hui Yang

pdf bib
Structured Refinement for Sequential Labeling
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe

pdf bib
End-to-End Construction of NLP Knowledge Graph
Ishani Mondal | Yufang Hou | Charles Jochim

pdf bib
Deciphering Implicit Hate: Evaluating Automated Detection Algorithms for Multimodal Hate
Austin Botelho | Scott Hale | Bertie Vidgen

pdf bib
Studying the Evolution of Scientific Topics and their Relationships
Ana Sabina Uban | Cornelia Caragea | Liviu P. Dinu

pdf bib
End-to-End Self-Debiasing Framework for Robust NLU Training
Abbas Ghaddar | Phillippe Langlais | Mehdi Rezagholizadeh | Ahmad Rashid

pdf bib
A Mixed-Method Design Approach for Empirically Based Selection of Unbiased Data Annotators
Gautam Thakur | Janna Caspersen | Drahomira Herrmannova | Bryan Eaton | Jordan Burdette

pdf bib
An Evaluation of Disentangled Representation Learning for Texts
Krishnapriya Vishnubhotla | Graeme Hirst | Frank Rudzicz

pdf bib
Injecting Knowledge Base Information into End-to-End Joint Entity and Relation Extraction and Coreference Resolution
Severine Verlinden | Klim Zaporojets | Johannes Deleu | Thomas Demeester | Chris Develder

pdf bib
Knowing More About Questions Can Help: Improving Calibration in Question Answering
Shujian Zhang | Chengyue Gong | Eunsol Choi

pdf bib
Enhancing Metaphor Detection by Gloss-based Interpretations
Hai Wan | Jinxia Lin | Jianfeng Du | Dawei Shen | Manrong Zhang

pdf bib
Evaluating Word Embeddings with Categorical Modularity
Sílvia Casacuberta | Karina Halevy | Damián Blasi

pdf bib
Attention-based Contextual Language Model Adaptation for Speech Recognition
Richard Diehl Martinez | Scott Novotney | Ivan Bulyko | Ariya Rastrow | Andreas Stolcke | Ankur Gandhe

pdf bib
Annotation and Evaluation of Coreference Resolution in Screenplays
Sabyasachee Baruah | Sandeep Nallan Chakravarthula | Shrikanth Narayanan

pdf bib
Exploring Cross-Lingual Transfer Learning with Unsupervised Machine Translation
Chao Wang | Judith Gaspers | Thi Ngoc Quynh Do | Hui Jiang

pdf bib
Pipeline Signed Japanese Translation Focusing on a Post-positional Particle Complement and Conjugation in a Low-resource Setting
Ken Yano | Akira Utsumi

pdf bib
Language-Mediated, Object-Centric Representation Learning
Ruocheng Wang | Jiayuan Mao | Samuel Gershman | Jiajun Wu

pdf bib
Entheos: A Multimodal Dataset for Studying Enthusiasm
Carla Viegas | Malihe Alikhani

pdf bib
Are Rotten Apples Edible? Challenging Commonsense Inference Ability with Exceptions
Nam Do | Ellie Pavlick

pdf bib
GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning
Zilong Zheng | Shuwen Qiu | Lifeng Fan | Yixin Zhu | Song-Chun Zhu

pdf bib
RetroGAN: A Cyclic Post-Specialization System for Improving Out-of-Knowledge and Rare Word Representations
Pedro Colon-Hernandez | Yida Xin | Henry Lieberman | Catherine Havasi | Cynthia Breazeal | Peter Chin

pdf bib
Fusion: Towards Automated ICD Coding via Feature Compression
Junyu Luo | Cao Xiao | Lucas Glass | Jimeng Sun | Fenglong Ma

pdf bib
Automatic Document Sketching: Generating Drafts from Analogous Texts
Zeqiu Wu | Michel Galley | Chris Brockett | Yizhe Zhang | Bill Dolan

pdf bib
Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading
Zhihan Zhou | Liqian Ma | Han Liu

pdf bib
Language-based General Action Template for Reinforcement Learning Agents
Ryosuke Kohita | Akifumi Wachi | Daiki Kimura | Subhajit Chaudhury | Michiaki Tatsubori | Asim Munawar

pdf bib
MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers
Wenhui Wang | Hangbo Bao | Shaohan Huang | Li Dong | Furu Wei

pdf bib
Attending via both Fine-tuning and Compressing
Jie Zhou | Yuanbin Wu | Qin Chen | Xuanjing Huang | Liang He

pdf bib
Improving Event Causality Identification via Self-Supervised Representation Learning on External Causal Statement
Xinyu Zuo | Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao | Weihua Peng | Yuguang Chen

pdf bib
PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval
Ruiyang Ren | Shangwen Lv | Yingqi Qu | Jing Liu | Wayne Xin Zhao | QiaoQiao She | Hua Wu | Haifeng Wang | Ji-Rong Wen

pdf bib
Is Human Scoring the Best Criteria for Summary Evaluation?
Oleg Vasilyev | John Bohannon

pdf bib
Assessing Dialogue Systems with Distribution Distances
Jiannan Xiang | Yahui Liu | Deng Cai | Huayang Li | Defu Lian | Lemao Liu

pdf bib
Neural Combinatory Constituency Parsing
Zhousi Chen | Longtu Zhang | Aizhan Imankulova | Mamoru Komachi

pdf bib
Learning Shared Semantic Space for Speech-to-Text Translation
Chi Han | Mingxuan Wang | Heng Ji | Lei Li

pdf bib
Empowering Language Understanding with Counterfactual Reasoning
Fuli Feng | Jizhi Zhang | Xiangnan He | Hanwang Zhang | Tat-Seng Chua

pdf bib
Knowledge-Empowered Representation Learning for Chinese Medical Reading Comprehension: Task, Model and Resources
Taolin Zhang | Chengyu Wang | Minghui Qiu | Bite Yang | Zerui Cai | Xiaofeng He | Jun Huang

pdf bib
Correcting Chinese Spelling Errors with Phonetic Pre-training
Ruiqing Zhang | Chao Pang | Chuanqiang Zhang | Shuohuan Wang | Zhongjun He | Yu Sun | Hua Wu | Haifeng Wang

pdf bib
Multi-Lingual Question Generation with Language Agnostic Language Model
Bingning Wang | Ting Yao | Weipeng Chen | Jingfang Xu | Xiaochuan Wang

pdf bib
Structure-Aware Pre-Training for Table-to-Text Generation
Xinyu Xing | Xiaojun Wan

pdf bib
On the Interplay Between Fine-tuning and Composition in Transformers
Lang Yu | Allyson Ettinger

pdf bib
Lifelong Learning of Topics and Domain-Specific Word Embeddings
Xiaorui Qin | Yuyin Lu | Yufu Chen | Yanghui Rao

pdf bib
Leveraging Argumentation Knowledge Graph for Interactive Argument Pair Identification
Jian Yuan | Zhongyu Wei | Donghua Zhao | Qi Zhang | Changjian Jiang

pdf bib
A Multi-Task Learning Framework for Multi-Target Stance Detection
Yingjie Li | Cornelia Caragea

pdf bib
Confidence-Aware Scheduled Sampling for Neural Machine Translation
Yijin Liu | Fandong Meng | Yufeng Chen | Jinan Xu | Jie Zhou

pdf bib
MA-BERT: Learning Representation by Incorporating Multi-Attribute Knowledge in Transformers
You Zhang | Jin Wang | Liang-Chih Yu | Xuejie Zhang

pdf bib
A Closer Look into the Robustness of Neural Dependency Parsers Using Better Adversarial Examples
Yuxuan Wang | Wanxiang Che | Ivan Titov | Shay B. Cohen | Zhilin Lei | Ting Liu

pdf bib
P-Stance: A Large Dataset for Stance Detection in Political Domain
Yingjie Li | Tiberiu Sosea | Aditya Sawant | Ajith Jayaraman Nair | Diana Inkpen | Cornelia Caragea

pdf bib
WIND: Weighting Instances Differentially for Model-Agnostic Domain Adaptation
Xiang Chen | Yue Cao | Xiaojun Wan

pdf bib
DocOIE: A Document-level Context-Aware Dataset for OpenIE
Kuicai Dong | Zhao Yilin | Aixin Sun | Jung-Jae Kim | Xiaoli Li

pdf bib
Event Extraction from Historical Texts: A New Dataset for Black Rebellions
Viet Dac Lai | Minh Van Nguyen | Heidi Kaufman | Thien Huu Nguyen

pdf bib
Zero-shot Medical Entity Retrieval without Annotation: Learning From Rich Knowledge Graph Semantics
Luyang Kong | Christopher Winestock | Parminder Bhatia

pdf bib
CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection
Henry Weld | Guanghao Huang | Jean Lee | Tongshu Zhang | Kunze Wang | Xinghong Guo | Siqu Long | Josiah Poon | Caren Han

pdf bib
Adaptive Knowledge-Enhanced Bayesian Meta-Learning for Few-shot Event Detection
Shirong Shen | Tongtong Wu | Guilin Qi | Yuan-Fang Li | Gholamreza Haffari | Sheng Bi

pdf bib
Stylized Story Generation with Style-Guided Planning
Xiangzhe Kong | Jialiang Huang | Ziquan Tung | Jian Guan | Minlie Huang

pdf bib
Dynamic Connected Networks for Chinese Spelling Check
Baoxin Wang | Wanxiang Che | Dayong Wu | Shijin Wang | Guoping Hu | Ting Liu

pdf bib
A Multi-Level Attention Model for Evidence-Based Fact Checking
Canasai Kruengkrai | Junichi Yamagishi | Xin Wang

pdf bib
RealTranS: End-to-End Simultaneous Speech Translation with Convolutional Weighted-Shrinking Transformer
Xingshan Zeng | Liangyou Li | Qun Liu

pdf bib
Training ELECTRA Augmented with Multi-word Selection
Jiaming Shen | Jialu Liu | Tianqi Liu | Cong Yu | Jiawei Han

pdf bib
REAM: An Enhancement Approach to Reference-based Evaluation Metrics for Open-domain Dialog Generation
Jun Gao | Wei Bi | Ruifeng Xu | Shuming Shi

pdf bib
Relation Extraction with Type-aware Map Memories of Word Dependencies
Guimin Chen | Yuanhe Tian | Yan Song | Xiang Wan

pdf bib
PLATO-2: Towards Building an Open-Domain Chatbot via Curriculum Learning
Siqi Bao | Huang He | Fan Wang | Hua Wu | Haifeng Wang | Wenquan Wu | Zhen Guo | Zhibin Liu | Xinchao Xu

pdf bib
JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs
Pei Ke | Haozhe Ji | Yu Ran | Xin Cui | Liwei Wang | Linfeng Song | Xiaoyan Zhu | Minlie Huang

pdf bib
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation
Wuwei Huang | Dexin Wang | Deyi Xiong

pdf bib
OKGIT: Open Knowledge Graph Link Prediction with Implicit Types
. Chandrahas | Partha Talukdar

pdf bib
Multimodal Fusion with Co-Attention Networks for Fake News Detection
Yang Wu | Pengwei Zhan | Yunjian Zhang | Liming Wang | Zhen Xu

pdf bib
Joint Multi-Decoder Framework with Hierarchical Pointer Network for Frame Semantic Parsing
Xudong Chen | Ce Zheng | Baobao Chang

pdf bib
H-FND: Hierarchical False-Negative Denoising for Distant Supervision Relation Extraction
Jhih-wei Chen | Tsu-Jui Fu | Chen-Kang Lee | Wei-Yun Ma

pdf bib
GEM: A General Evaluation Benchmark for Multimodal Tasks
Lin Su | Nan Duan | Edward Cui | Lei Ji | Chenfei Wu | Huaishao Luo | Yongfei Liu | Ming Zhong | Taroon Bharti | Arun Sacheti

pdf bib
Graph Relational Topic Model with Higher-order Graph Attention Auto-encoders
Qianqian Xie | Jimin Huang | Pan Du | Min Peng

pdf bib
Paths to Relation Extraction through Semantic Structure
Jonathan Yellin | Omri Abend

pdf bib
Dynamic and Multi-Channel Graph Convolutional Networks for Aspect-Based Sentiment Analysis
Shiguan Pang | Yun Xue | Zehao Yan | Weihao Huang | Jinhui Feng

pdf bib
Automatic Text Simplification for Social Good: Progress and Challenges
Sanja Stajner

pdf bib
A Neural Edge-Editing Approach for Document-Level Relation Graph Extraction
Kohei Makino | Makoto Miwa | Yutaka Sasaki

pdf bib
Dialogue-oriented Pre-training
Yi Xu | Hai Zhao

pdf bib
GrantRel: Grant Information Extraction via Joint Entity and Relation Extraction
Junyi Bian | Li Huang | Xiaodi Huang | Hong Zhou | Shanfeng Zhu

pdf bib
Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
Jeonghyeok Park | Hai Zhao

pdf bib
Making Flexible Use of Subtasks: A Multiplex Interaction Network for Unified Aspect-based Sentiment Analysis
Guoxin Yu | Xiang Ao | Ling Luo | Min Yang | Xiaofei Sun | Jiwei Li | Qing He

pdf bib
Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation
Zihan Liu | Genta Indra Winata | Pascale Fung

pdf bib
Transformer-Exclusive Cross-Modal Representation for Vision and Language
Andrew Shin | Takuya Narihira

pdf bib
Two Parents, One Child: Dual Transfer for Low-Resource Neural Machine Translation
Meng Zhang | Liangyou Li | Qun Liu

pdf bib
Contrastive Aligned Joint Learning for Multilingual Summarization
Danqing Wang | Jiaze Chen | Hao Zhou | Xipeng Qiu | Lei Li

pdf bib
When Time Makes Sense: A Historically-Aware Approach to Targeted Sense Disambiguation
Kaspar Beelen | Federico Nanni | Mariona Coll Ardanuy | Kasra Hosseini | Giorgia Tolfo | Barbara McGillivray

pdf bib
Understanding Feature Focus in Multitask Settings for Lexico-semantic Relation Identification
Houssam Akhmouch | Gaël Dias | Jose G. Moreno

pdf bib
Don’t Miss the Labels: Label-semantic Augmented Meta-Learner for Few-Shot Text Classification
Qiaoyang Luo | Lingqiao Liu | Yuhao Lin | Wei Zhang

pdf bib
Detecting Harmful Memes and Their Targets
Shraman Pramanick | Dimitar Dimitrov | Rituparna Mukherjee | Shivam Sharma | Md. Shad Akhtar | Preslav Nakov | Tanmoy Chakraborty

pdf bib
Progressive Multi-Granularity Training for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu

pdf bib
ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation
Kaushal Kumar Maurya | Maunendra Sankar Desarkar | Yoshinobu Kano | Kumari Deepshikha

pdf bib
HacRED: A Large-Scale Relation Extraction Dataset Toward Hard Cases in Practical Applications
Qiao Cheng | Juntao Liu | Xiaoye Qu | Jin Zhao | Jiaqing Liang | Zhefeng Wang | Baoxing Huai | Nicholas Jing Yuan | Yanghua Xiao

pdf bib
Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads?
Zae Myung Kim | Laurent Besacier | Vassilina Nikoulina | Didier Schwab

pdf bib
Learning Sequential and Structural Information for Source Code Summarization
YunSeok Choi | JinYeong Bak | CheolWon Na | Jee-Hyong Lee

pdf bib
Energy-based Unknown Intent Detection with Data Manipulation
Yawen Ouyang | Jiasheng Ye | Yu Chen | Xinyu Dai | Shujian Huang | Jiajun Chen

pdf bib
Automatic Rephrasing of Transcripts-based Action Items
Amir Cohen | Amir Kantor | Sagi Hilleli | Eyal Kolman

pdf bib
MergeDistill: Merging Language Models using Pre-trained Distillation
Simran Khanuja | Melvin Johnson | Partha Talukdar

pdf bib
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
Biao Zhang | Ivan Titov | Rico Sennrich

pdf bib
FrameNet-assisted Noun Compound Interpretation
Girishkumar Ponkiya | Diptesh Kanojia | Pushpak Bhattacharyya | Girish Palshikar

pdf bib
Hypernym Discovery via a Recurrent Mapping Model
Yuhang Bai | Richong Zhang | Fanshuang Kong | Junfan Chen | Yongyi Mao

pdf bib
Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT
Won Ik Cho | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang

pdf bib
On the Interaction of Belief Bias and Explanations
Ana Valeria González | Anna Rogers | Anders Søgaard

pdf bib
Combining Static Word Embeddings and Contextual Representations for Bilingual Lexicon Induction
Jinpeng Zhang | Baijun Ji | Nini Xiao | Xiangyu Duan | Min Zhang | Yangbin Shi | Weihua Luo

pdf bib
Exploring Unsupervised Pretraining Objectives for Machine Translation
Christos Baziotis | Ivan Titov | Alexandra Birch | Barry Haddow

pdf bib
Knowledge-Grounded Dialogue Generation with Term-level De-noising
Wen Zheng | Natasa Milic-Frayling | Ke Zhou

pdf bib
Inspecting the concept knowledge graph encoded by modern language models
Carlos Aspillaga | Marcelo Mendoza | Alvaro Soto

pdf bib
Language Tags Matter for Zero-Shot Neural Machine Translation
Liwei Wu | Shanbo Cheng | Mingxuan Wang | Lei Li

pdf bib
Latent Reasoning for Low-Resource Question Generation
Xinting Huang | Jianzhong Qi | Yu Sun | Rui Zhang

pdf bib
Probing Pre-Trained Language Models for Disease Knowledge
Israa Alghanmi | Luis Espinosa Anke | Steven Schockaert

pdf bib
AugVic: Exploiting BiText Vicinity for Low-Resource NMT
Tasnim Mohiuddin | M Saiful Bari | Shafiq Joty

pdf bib
Provably Secure Generative Linguistic Steganography
Siyu Zhang | Zhongliang Yang | Jinshuai Yang | Yongfeng Huang

pdf bib
Retrieval Enhanced Model for Commonsense Generation
Han Wang | Yang Liu | Chenguang Zhu | Linjun Shou | Ming Gong | Yichong Xu | Michael Zeng

pdf bib
Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL
Zhi Chen | Lu Chen | Hanqi Li | Ruisheng Cao | Da Ma | Mengyue Wu | Kai Yu

pdf bib
Adjacency List Oriented Relational Fact Extraction via Adaptive Multi-task Learning
Fubang Zhao | Zhuoren Jiang | Yangyang Kang | Changlong Sun | Xiaozhong Liu

pdf bib
Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference
Dvir Ginzburg | Itzik Malkiel | Oren Barkan | Avi Caciularu | Noam Koenigstein

pdf bib
How Good Is NLP? A Sober Look at NLP Tasks through the Lens of Social Impact
Zhijing Jin | Geeticka Chauhan | Brian Tse | Mrinmaya Sachan | Rada Mihalcea

pdf bib
IgSEG: Image-guided Story Ending Generation
Qingbao Huang | Chuan Huang | Linzhang Mo | Jielong Wei | Yi Cai | Ho-fung Leung | Qing Li

pdf bib
Improve Query Focused Abstractive Summarization by Incorporating Answer Relevance
Dan Su | Tiezheng Yu | Pascale Fung

pdf bib
Learning a Reversible Embedding Mapping using Bi-Directional Manifold Alignment
Ashwinkumar Ganesan | Francis Ferraro | Tim Oates

pdf bib
Probabilistic Graph Reasoning for Natural Proof Generation
Changzhi Sun | Xinbo Zhang | Jiangjie Chen | Chun Gan | Yuanbin Wu | Jiaze Chen | Hao Zhou | Lei Li

pdf bib
Enhancing Zero-shot and Few-shot Stance Detection with Commonsense Knowledge Graph
Rui Liu | Zheng Lin | Yutong Tan | Weiping Wang

pdf bib
Dialogue Graph Modeling for Conversational Machine Reading
Siru Ouyang | Zhuosheng Zhang | Hai Zhao

pdf bib
IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words
Haryo Akbarianto Wibowo | Made Nindyatama Nityasya | Afra Feyza Akyürek | Suci Fitriany | Alham Fikri Aji | Radityo Eko Prasojo | Derry Tanti Wijaya

pdf bib
Manifold Adversarial Augmentation for Neural Machine Translation
Guandan Chen | Kai Fan | Kaibo Zhang | Boxing Chen | Zhongqiang Huang

pdf bib
Learning to Bridge Metric Spaces: Few-shot Joint Learning of Intent Detection and Slot Filling
Yutai Hou | Yongkui Lai | Cheng Chen | Wanxiang Che | Ting Liu

pdf bib
Insertion-based Tree Decoding
Denis Lukovnikov | Asja Fischer

pdf bib
Is the Lottery Fair? Evaluating Winning Tickets Across Demographics
Victor Petrén Bach Hansen | Anders Søgaard

pdf bib
SSMix: Saliency-Based Span Mixup for Text Classification
Soyoung Yoon | Gyuwan Kim | Kyumin Park

pdf bib
Detecting Bot-Generated Text by Characterizing Linguistic Accommodation in Human-Bot Interactions
Paras Bhatt | Anthony Rios

pdf bib
Defending Pre-trained Language Models from Adversarial Word Substitution Without Performance Sacrifice
Rongzhou Bao | Jiayi Wang | Hai Zhao

pdf bib
BERT-Proof Syntactic Structures: Investigating Errors in Discontinuous Constituency Parsing
Maximin Coavoux

pdf bib
DoT: An efficient Double Transformer for NLP tasks with tables
Syrine Krichene | Thomas Müller | Julian Eisenschlos

pdf bib
Grammatical Error Correction as GAN-like Sequence Labeling
Kevin Parnow | Zuchao Li | Hai Zhao

pdf bib
Neural Entity Recognition with Gazetteer based Fusion
Qing Sun | Parminder Bhatia

pdf bib
Hyperbolic Temporal Knowledge Graph Embeddings with Relational and Time Curvatures
Sebastien Montella | Lina M. Rojas Barahona | Johannes Heinecke

pdf bib
Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
Aditya Gupta | Jiacheng Xu | Shyam Upadhyay | Diyi Yang | Manaal Faruqui

pdf bib
Does Robustness Improve Fairness? Approaching Fairness with Word Substitution Robustness Methods for Text Classification
Yada Pruksachatkun | Satyapriya Krishna | Jwala Dhamala | Rahul Gupta | Kai-Wei Chang

pdf bib
A Joint Model for Structure-based News Genre Classification with Application to Text Summarization
Zeyu Dai | Ruihong Huang

pdf bib
Representing Syntax and Composition with Geometric Transformations
Lorenzo Bertolini | Julie Weeds | David Weir | Qiwei Peng

pdf bib
Figurative Language in Recognizing Textual Entailment
Tuhin Chakrabarty | Debanjan Ghosh | Adam Poliak | Smaranda Muresan

pdf bib
To Point or Not to Point: Understanding How Abstractive Summarizers Paraphrase Text
Matt Wilber | William Timkey | Marten van Schijndel

pdf bib
AgreeSum: Agreement-Oriented Multi-Document Summarization
Richard Yuanzhe Pang | Adam Lelkes | Vinh Tran | Cong Yu

pdf bib
BERT Busters: Outlier Dimensions that Disrupt Transformers
Olga Kovaleva | Saurabh Kulshreshtha | Anna Rogers | Anna Rumshisky

pdf bib
“We will Reduce Taxes” - Identifying Election Pledges with Language Models
Tommaso Fornaciari | Dirk Hovy | Elin Naurin | Julia Runeson | Robert Thomson | Pankaj Adhikari

pdf bib
WeaQA: Weak Supervision via Captions for Visual Question Answering
Pratyay Banerjee | Tejas Gokhale | Yezhou Yang | Chitta Baral

pdf bib
How well do you know your summarization datasets?
Priyam Tejaswin | Dhruv Naik | Pengfei Liu

pdf bib
Multilingual Translation from Denoising Pre-Training
Yuqing Tang | Chau Tran | Xian Li | Peng-Jen Chen | Naman Goyal | Vishrav Chaudhary | Jiatao Gu | Angela Fan

pdf bib
Annotations Matter: Leveraging Multi-task Learning to Parse UD and SUD
Zeeshan Ali Sayyed | Daniel Dakota

pdf bib
Generating Informative Conclusions for Argumentative Texts
Shahbaz Syed | Khalid Al Khatib | Milad Alshomary | Henning Wachsmuth | Martin Potthast

pdf bib
Substructure Substitution: Structured Data Augmentation for NLP
Haoyue Shi | Karen Livescu | Kevin Gimpel

pdf bib
Towards Protecting Vital Healthcare Programs by Extracting Actionable Knowledge from Policy
Vanessa Lopez | Nagesh Yadav | Gabriele Picco | Inge Vejsbjerg | Eoin Carrol | Seamus Brady | Marco Luca Sbodio | Lam Thanh Hoang | Miao Wei | John Segrave

pdf bib
Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax
Ehsan Kamalloo | Mehdi Rezagholizadeh | Peyman Passban | Ali Ghodsi

pdf bib
It’s All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning
Alexey Tikhonov | Max Ryabinin

pdf bib
Biomedical Interpretable Entity Representations
Diego Garcia-Olano | Yasumasa Onoe | Ioana Baldini | Joydeep Ghosh | Byron Wallace | Kush Varshney

pdf bib
Learning Robust Latent Representations for Controllable Speech Synthesis
Shakti Kumar | Jithin Pradeep | Hussain Zaidi

pdf bib
How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation
Marco Gaido | Beatrice Savoldi | Luisa Bentivogli | Matteo Negri | Marco Turchi

pdf bib
On the Ethical Limits of Natural Language Processing on Legal Text
Dimitrios Tsarapatsanis | Nikolaos Aletras

pdf bib
An Exploratory Analysis of the Relation between Offensive Language and Mental Health
Ana-Maria Bucur | Marcos Zampieri | Liviu P. Dinu

pdf bib
Transforming Term Extraction: Transformer-Based Approaches to Multilingual Term Extraction Across Domains
Christian Lang | Lennart Wachowiak | Barbara Heinisch | Dagmar Gromann

pdf bib
ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language
Oyvind Tafjord | Bhavana Dalvi | Peter Clark

pdf bib
Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks | Aida Nematzadeh

pdf bib
Implications of Using Internet Sting Corpora to Approximate Underage Victims
Tatiana Ringenberg | Kathryn Seigfried-Spellar | Julia Rayz

pdf bib
Detecting Domain Polarity-Changes of Words in a Sentiment Lexicon
Shuai Wang | Guangyi Lv | Sahisnu Mazumder | Bing Liu

pdf bib
Analyzing Online Political Advertisements
Danae Sánchez Villegas | Saeid Mokaram | Nikolaos Aletras

pdf bib
Do Language Models Perform Generalizable Commonsense Inference?
Peifeng Wang | Filip Ilievski | Muhao Chen | Xiang Ren

pdf bib
Probing Multi-modal Machine Translation with Pre-trained Language Model
Kong Yawei | Kai Fan

pdf bib
The interplay between language similarity and script on a novel multi-layer Algerian dialect corpus
Samia Touileb | Jeremy Barnes

pdf bib
Few-Shot Upsampling for Protest Size Detection
Andrew Halterman | Benjamin J. Radford

pdf bib
Modeling the Unigram Distribution
Irene Nikkarinen | Tiago Pimentel | Damián Blasi | Ryan Cotterell

pdf bib
On the Lack of Robust Interpretability of Neural Text Classifiers
Muhammad Bilal Zafar | Michele Donini | Dylan Slack | Cedric Archambeau | Sanjiv Das | Krishnaram Kenthapadi

pdf bib
Multimodal Graph-based Transformer Framework for Biomedical Relation Extraction
Sriram Pingali | Shweta Yadav | Pratik Dutta | Sriparna Saha

pdf bib
Summary Grounded Conversation Generation
Chulaka Gunasekara | Guy Feigenblat | Benjamin Sznajder | Sachindra Joshi | David Konopnicki

pdf bib
A Non-Autoregressive Edit-Based Approach to Controllable Text Simplification
Sweta Agrawal | Weijia Xu | Marine Carpuat

pdf bib
Investigating Transfer Learning in Multilingual Pre-trained Language Models through Chinese Natural Language Inference
Hai Hu | He Zhou | Zuoyu Tian | Yiwen Zhang | Yina Patterson | Yanting Li | Yixin Nie | Kyle Richardson

pdf bib
Using surprisal and fMRI to map the neural bases of broad and local contextual prediction during natural language comprehension
Shohini Bhattasali | Philip Resnik

pdf bib
Assessing the Syntactic Capabilities of Transformer-based Multilingual Language Models
Laura Pérez-Mayos | Alba Táboas García | Simon Mille | Leo Wanner

pdf bib
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Ruiqi Zhong | Dhruba Ghosh | Dan Klein | Jacob Steinhardt

pdf bib
Named Entity Recognition through Deep Representation Learning and Weak Supervision
Jerrod Parker | Shi Yu

pdf bib
Explaining NLP Models via Minimal Contrastive Editing (MiCE)
Alexis Ross | Ana Marasović | Matthew Peters

pdf bib
Differential Privacy for Text Analytics via Natural Text Sanitization
Xiang Yue | Minxin Du | Tianhao Wang | Yaliang Li | Huan Sun | Sherman S. M. Chow

pdf bib
Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation
Prakhar Gupta | Yulia Tsvetkov | Jeffrey Bigham

pdf bib
Leveraging Abstract Meaning Representation for Knowledge Base Question Answering
Pavan Kapanipathi | Ibrahim Abdelaziz | Srinivas Ravishankar | Salim Roukos | Alexander Gray | Ramón Fernandez Astudillo | Maria Chang | Cristina Cornelio | Saswati Dana | Achille Fokoue | Dinesh Garg | Alfio Gliozzo | Sairam Gurajada | Hima Karanam | Naweed Khan | Dinesh Khandelwal | Young-Suk Lee | Yunyao Li | Francois Luus | Ndivhuwo Makondo | Nandana Mihindukulasooriya | Tahira Naseem | Sumit Neelam | Lucian Popa | Revanth Gangi Reddy | Ryan Riegel | Gaetano Rossiello | Udit Sharma | G P Shrivatsa Bhargav | Mo Yu

pdf bib
On the Gap between Adoption and Understanding in NLP
Federico Bianchi | Dirk Hovy

pdf bib
Learning Disentangled Latent Topics for Twitter Rumour Veracity Classification
John Dougrez-Lewis | Maria Liakata | Elena Kochkina | Yulan He

pdf bib
Perceptual Models of Machine-Edited Text
Elizabeth Merkhofer | Monica-Ann Mendoza | Rebecca Marvin | John Henderson

pdf bib
Scaling Within Document Coreference to Long Texts
Raghuveer Thirukovalluru | Nicholas Monath | Kumar Shridhar | Manzil Zaheer | Mrinmaya Sachan | Andrew McCallum

pdf bib
LEWIS: Levenshtein Editing for Unsupervised Text Style Transfer
Machel Reid | Victor Zhong

pdf bib
Constructing Flow Graphs from Procedural Cybersecurity Texts
Kuntal Kumar Pal | Kazuaki Kashihara | Pratyay Banerjee | Swaroop Mishra | Ruoyu Wang | Chitta Baral

pdf bib
Cluster-Former: Clustering-based Sparse Transformer for Question Answering
Shuohang Wang | Luowei Zhou | Zhe Gan | Yen-Chun Chen | Yuwei Fang | Siqi Sun | Yu Cheng | Jingjing Liu

pdf bib
Minimally-Supervised Morphological Segmentation using Adaptor Grammars with Linguistic Priors
Ramy Eskander | Cass Lowry | Sujay Khandagale | Francesca Callejas | Judith Klavans | Maria Polinsky | Smaranda Muresan

pdf bib
Multi-Task Learning and Adapted Knowledge Models for Emotion-Cause Extraction
Elsbeth Turcan | Shuai Wang | Rishita Anubhai | Kasturi Bhattacharjee | Yaser Al-Onaizan | Smaranda Muresan

pdf bib
The Utility and Interplay of Gazetteers and Entity Segmentation for Named Entity Recognition in English
Oshin Agarwal | Ani Nenkova

pdf bib
On the Cost-Effectiveness of Stacking of Neural and Non-Neural Methods for Text Classification: Scenarios and Performance Prediction
Christian Gomes | Marcos Goncalves | Leonardo Rocha | Sergio Canuto

pdf bib
Unsupervised Domain Adaptation for Event Detection using Domain-specific Adapters
Nghia Ngo Trung | Duy Phung | Thien Huu Nguyen

pdf bib
Predicting in-hospital mortality by combining clinical notes with time-series data
Iman Deznabi | Mohit Iyyer | Madalina Fiterau

pdf bib
Sequence Models for Computational Etymology of Borrowings
Winston Wu | Kevin Duh | David Yarowsky

pdf bib
Learning Contextualized Knowledge Structures for Commonsense Reasoning
Jun Yan | Mrigank Raman | Aaron Chan | Tianyu Zhang | Ryan Rossi | Handong Zhao | Sungchul Kim | Nedim Lipka | Xiang Ren

pdf bib
Analyzing Stereotypes in Generative Text Inference Tasks
Anna Sotnikova | Yang Trista Cao | Hal Daumé III | Rachel Rudinger

pdf bib
HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction
Liliang Ren | Chenkai Sun | Heng Ji | Julia Hockenmaier

pdf bib
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation
Varun Gangal | Harsh Jhamtani | Eduard Hovy | Taylor Berg-Kirkpatrick

pdf bib
Who Blames or Endorses Whom? Entity-to-Entity Directed Sentiment Extraction in News Text
Kunwoo Park | Zhufeng Pan | Jungseock Joo

pdf bib
New Dataset and Strong Baselines for the Grammatical Error Correction of Russian
Viet Anh Trinh | Alla Rozovskaya

pdf bib
A Formidable Ability: Detecting Adjectival Extremeness with DSMs
Farhan Samir | Barend Beekhuizen | Suzanne Stevenson

pdf bib
Effective Attention Sheds Light On Interpretability
Kaiser Sun | Ana Marasović

pdf bib
Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models
Logan Born | Kathryn Kelley | M. Willis Monroe | Anoop Sarkar

pdf bib
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
Tianchu Ji | Shraddhan Jain | Michael Ferdman | Peter Milder | H. Andrew Schwartz | Niranjan Balasubramanian

pdf bib
Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
Jieyu Zhao | Daniel Khashabi | Tushar Khot | Ashish Sabharwal | Kai-Wei Chang

pdf bib
Unsupervised Label Refinement Improves Dataless Text Classification
Zewei Chu | Karl Stratos | Kevin Gimpel

pdf bib
Prompting Contrastive Explanations for Commonsense Reasoning Tasks
Bhargavi Paranjape | Julian Michael | Marjan Ghazvininejad | Hannaneh Hajishirzi | Luke Zettlemoyer

pdf bib
SMS Spam Detection Through Skip-gram Embeddings and Shallow Networks
Gustavo Sousa | Daniel Carlos Guimarães Pedronette | João Paulo Papa | Ivan Rizzo Guilherme

pdf bib
Hierarchical Task Learning from Language Instructions with Unified Transformers and Self-Monitoring
Yichi Zhang | Joyce Chai

pdf bib
Marked Attribute Bias in Natural Language Inference
Hillary Dawkins

pdf bib
VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding
Hu Xu | Gargi Ghosh | Po-Yao Huang | Prahal Arora | Masoumeh Aminzadeh | Christoph Feichtenhofer | Florian Metze | Luke Zettlemoyer

pdf bib
Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence
Andrew Halterman | Katherine Keith | Sheikh Sarwar | Brendan O’Connor

pdf bib
Memory-Efficient Differentiable Transformer Architecture Search
Yuekai Zhao | Li Dong | Yelong Shen | Zhihua Zhang | Furu Wei | Weizhu Chen

pdf bib
On the Copying Behaviors of Pre-Training for Neural Machine Translation
Xuebo Liu | Longyue Wang | Derek F. Wong | Liang Ding | Lidia S. Chao | Shuming Shi | Zhaopeng Tu

pdf bib
Answer Generation for Retrieval-based Question Answering Systems
Chao-Chun Hsu | Eric Lind | Luca Soldaini | Alessandro Moschitti

pdf bib
Grounding ‘Grounding’ in NLP
Khyathi Raghavi Chandu | Yonatan Bisk | Alan W Black

pdf bib
Federated Chinese Word Segmentation with Global Character Associations
Yuanhe Tian | Guimin Chen | Han Qin | Yan Song

pdf bib
PSED: A Dataset for Selecting Emphasis in Presentation Slides
Amirreza Shirani | Giai Tran | Hieu Trinh | Franck Dernoncourt | Nedim Lipka | Jose Echevarria | Thamar Solorio | Paul Asente

pdf bib
MLMLM: Link Prediction with Mean Likelihood Masked Language Model
Louis Clouatre | Philippe Trempe | Amal Zouaq | Sarath Chandar

pdf bib
Modulating Language Models with Emotions
Ruibo Liu | Jason Wei | Chenyan Jia | Soroush Vosoughi

pdf bib
Effective Batching for Recurrent Neural Network Grammars
Hiroshi Noji | Yohei Oseki

pdf bib
Verb Sense Clustering using Contextualized Word Representations for Semantic Frame Induction
Kosuke Yamada | Ryohei Sasano | Koichi Takeda

pdf bib
Benchmarking Neural Topic Models: An Empirical Study
Thanh-Nam Doan | Tuan-Anh Hoang

pdf bib
Enhancing Chinese Word Segmentation via Pseudo Labels for Practicability
Kaiyu Huang | Junpeng Liu | Degen Huang | Deyi Xiong | Zhuang Liu | Jinsong Su

pdf bib
Analysis of Tree-Structured Architectures for Code Generation
Samip Dahal | Adyasha Maharana | Mohit Bansal

pdf bib
How Does Distilled Data Complexity Impact the Quality and Confidence of Non-Autoregressive Machine Translation?
Weijia Xu | Shuming Ma | Dongdong Zhang | Marine Carpuat

pdf bib
Leveraging Topic Relatedness for Argument Persuasion
Xinran Zhao | Esin Durmus | Hongming Zhang | Claire Cardie

pdf bib
One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers
Chuhan Wu | Fangzhao Wu | Yongfeng Huang

pdf bib
Logic-Consistency Text Generation from Semantic Parses
Chang Shu | Yusen Zhang | Xiangyu Dong | Peng Shi | Tao Yu | Rui Zhang

pdf bib
Inducing Semantic Roles Without Syntax
Julian Michael | Luke Zettlemoyer

pdf bib
Plot and Rework: Modeling Storylines for Visual Storytelling
Chi-yang Hsu | Yun-Wei Chu | Ting-Hao Huang | Lun-Wei Ku

pdf bib
Disentangled Code Representation Learning for Multiple Programming Languages
Jingfeng Zhang | Haiwen Hong | Yin Zhang | Yao Wan | Ye Liu | Yulei Sui

pdf bib
Exploring Self-Identified Counseling Expertise in Online Support Forums
Allison Lahnala | Yuntian Zhao | Charles Welch | Jonathan K. Kummerfeld | Lawrence C An | Kenneth Resnicow | Rada Mihalcea | Verónica Pérez-Rosas

pdf bib
An Investigation of Suitability of Pre-Trained Language Models for Dialogue Generation – Avoiding Discrepancies
Yan Zeng | Jian-Yun Nie

pdf bib
Learning to Sample Replacements for ELECTRA Pre-Training
Yaru Hao | Li Dong | Hangbo Bao | Ke Xu | Furu Wei

pdf bib
Reordering Examples Helps during Priming-based Few-Shot Learning
Sawan Kumar | Partha Talukdar

pdf bib
Constrained Labeled Data Generation for Low-Resource Named Entity Recognition
Ruohao Guo | Dan Roth

pdf bib
He is very intelligent, she is very beautiful? On Mitigating Social Biases in Language Modelling and Generation
Aparna Garimella | Akhash Amarnath | Kiran Kumar | Akash Pramod Yalla | Anandhavelu N | Niyati Chhaya | Balaji Vasan Srinivasan

pdf bib
Task-adaptive Pre-training of Language Models with Word Embedding Regularization
Kosuke Nishida | Kyosuke Nishida | Sen Yoshida

pdf bib
Do Grammatical Error Correction Models Realize Grammatical Generalization?
Masato Mita | Hitomi Yanaka

pdf bib
Domain-Aware Dependency Parsing for Questions
Aparna Garimella | Laura Chiticariu | Yunyao Li

pdf bib
Using Social and Linguistic Information to Adapt Pretrained Representations for Political Perspective Identification
Chang Li | Dan Goldwasser

pdf bib
Enhancing Dialogue-based Relation Extraction by Speaker and Trigger Words Prediction
Tianyang Zhao | Zhao Yan | Yunbo Cao | Zhoujun Li

pdf bib
Modeling Event-Pair Relations in External Knowledge Graphs for Script Reasoning
Yucheng Zhou | Xiubo Geng | Tao Shen | Jian Pei | Wenqiang Zhang | Daxin Jiang

pdf bib
PROST: Physical Reasoning about Objects through Space and Time
Stéphane Aroca-Ouellette | Cory Paik | Alessandro Roncone | Katharina Kann

pdf bib
Revisiting the Evaluation of End-to-end Event Extraction
Shun Zheng | Wei Cao | Wei Xu | Jiang Bian

pdf bib
Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR
Junkun Chen | Mingbo Ma | Renjie Zheng | Liang Huang

pdf bib
HIT - A Hierarchically Fused Deep Attention Network for Robust Code-mixed Language Representation
Ayan Sengupta | Sourabh Kumar Bhattacharjee | Tanmoy Chakraborty | Md. Shad Akhtar

pdf bib
Semi-Supervised Data Programming with Subset Selection
Ayush Maheshwari | Oishik Chatterjee | Krishnateja Killamsetty | Ganesh Ramakrishnan | Rishabh Iyer

pdf bib
Fingerprinting Fine-tuned Language Models in the Wild
Nirav Diwan | Tanmoy Chakraborty | Zubair Shafiq

pdf bib
Analyzing Code Embeddings for Coding Clinical Narratives
Wei Shi | Jiewen Wu | Xiwen Yang | Nancy Chen | Ivan Ho Mien | Jung-Jae Kim | Pavitra Krishnaswamy

pdf bib
Automatic Construction of Sememe Knowledge Bases via Dictionaries
Fanchao Qi | Yangyi Chen | Fengyu Wang | Zhiyuan Liu | Xiao Chen | Maosong Sun

pdf bib
Rule-Aware Reinforcement Learning for Knowledge Graph Reasoning
Zhongni Hou | Xiaolong Jin | Zixuan Li | Long Bai

pdf bib
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Tahmid Hasan | Abhik Bhattacharjee | Md. Saiful Islam | Kazi Mubasshir | Yuan-Fang Li | Yong-Bin Kang | M. Sohel Rahman | Rifat Shahriyar

pdf bib
Use of Formal Ethical Reviews in NLP Literature: Historical Trends and Current Practices
Sebastin Santy | Anku Rani | Monojit Choudhury

pdf bib
As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation
Jun Wang | Chang Xu | Francisco Guzmán | Ahmed El-Kishky | Benjamin Rubinstein | Trevor Cohn

pdf bib
Investigating Memorization of Conspiracy Theories in Text Generation
Sharon Levy | Michael Saxon | William Yang Wang

pdf bib
A Text-Centered Shared-Private Framework via Cross-Modal Prediction for Multimodal Sentiment Analysis
Yang Wu | Zijie Lin | Yanyan Zhao | Bing Qin | Li-Nan Zhu

pdf bib
What Would a Teacher Do? Predicting Future Talk Moves
Ananya Ganesh | Martha Palmer | Katharina Kann

pdf bib
BioGen: Generating Biography Summary under Table Guidance on Wikipedia
Shen Gao | Xiuying Chen | Chang Liu | Dongyan Zhao | Rui Yan

pdf bib
Multilingual Simultaneous Neural Machine Translation
Philip Arthur | Dongwon Ryu | Gholamreza Haffari

pdf bib
Cross-Domain Review Generation for Aspect-Based Sentiment Analysis
Jianfei Yu | Chenggong Gong | Rui Xia

pdf bib
On the Language Coverage Bias for Neural Machine Translation
Shuo Wang | Zhaopeng Tu | Zhixing Tan | Shuming Shi | Maosong Sun | Yang Liu

pdf bib
Named Entity Recognition via Noise Aware Training Mechanism with Data Filter
Xiusheng Huang | Yubo Chen | Shun Wu | Jun Zhao | Yuantao Xie | Weijian Sun

pdf bib
A Multi-Task Approach for Improving Biomedical Named Entity Recognition by Incorporating Multi-Granularity information
Yiqi Tong | Yidong Chen | Xiaodong Shi

pdf bib
EBERT: Efficient BERT Inference with Dynamic Structured Pruning
Zejian Liu | Fanrong Li | Gang Li | Jian Cheng

pdf bib
Strong and Light Baseline Models for Fact-Checking Joint Inference
Kateryna Tymoshenko | Alessandro Moschitti

pdf bib
Sketch and Refine: Towards Faithful and Informative Table-to-Text Generation
Peng Wang | Junyang Lin | An Yang | Chang Zhou | Yichang Zhang | Jingren Zhou | Hongxia Yang

pdf bib
TILGAN: Transformer-based Implicit Latent GAN for Diverse and Coherent Text Generation
Shizhe Diao | Xinwei Shen | Kashun Shum | Yan Song | Tong Zhang

pdf bib
John praised Mary because _he_? Implicit Causality Bias and Its Interaction with Explicit Cues in LMs
Yova Kementchedjhieva | Mark Anderson | Anders Søgaard

pdf bib
Do It Once: An Embarrassingly Simple Joint Matching Approach to Response Selection
Linhao Zhang | Dehong Ma | Sujian Li | Houfeng Wang

pdf bib
Climbing the Tower of Treebanks: Improving Low-Resource Dependency Parsing via Hierarchical Source Selection
Goran Glavaš | Ivan Vulić

pdf bib
Enhancing the Open-Domain Dialogue Evaluation in Latent Space
Zhangming Chan | Lemao Liu | Juntao Li | Haisong Zhang | Dongyan Zhao | Shuming Shi | Rui Yan

pdf bib
Adapting Monolingual Models: Data can be Scarce when Language Similarity is High
Wietse de Vries | Martijn Bartelds | Malvina Nissim | Martijn Wieling

pdf bib
BatchMixup: Improving Training by Interpolating Hidden States of the Entire Mini-batch
Wenpeng Yin | Huan Wang | Jin Qu | Caiming Xiong

pdf bib
DocNLI: A Large-scale Dataset for Document-level Natural Language Inference
Wenpeng Yin | Dragomir Radev | Caiming Xiong

pdf bib
Rule Augmented Unsupervised Constituency Parsing
Atul Sahay | Anshul Nasery | Ayush Maheshwari | Ganesh Ramakrishnan | Rishabh Iyer

pdf bib
Are Multilingual Models the Best Choice for Moderately Under-resourced Languages? A Comprehensive Assessment for Catalan
Jordi Armengol-Estapé | Casimiro Pio Carrino | Carlos Rodriguez-Penagos | Ona de Gibert Bonet | Carme Armentano-Oller | Aitor Gonzalez-Agirre | Maite Melero | Marta Villegas

pdf bib
How transfer learning impacts linguistic knowledge in deep NLP models?
Nadir Durrani | Hassan Sajjad | Fahim Dalvi

pdf bib
Language Models Use Monotonicity to Assess NPI Licensing
Jaap Jumelet | Milica Denic | Jakub Szymanik | Dieuwke Hupkes | Shane Steinert-Threlkeld

pdf bib
Slot Transferability for Cross-domain Slot Filling
Hengtong Lu | Zhuoxin Han | Caixia Yuan | Xiaojie Wang | Shuyu Lei | Huixing Jiang | Wei Wu

pdf bib
Word Graph Guided Summarization for Radiology Findings
Jinpeng Hu | Jianling Li | Zhihong Chen | Yaling Shen | Yan Song | Xiang Wan | Tsung-Hui Chang

pdf bib
Generalized Supervised Attention for Text Generation
Yixian Liu | Liwen Zhang | Xinyu Zhang | Yong Jiang | Yue Zhang | Kewei Tu

pdf bib
Uncertainty Aware Review Hallucination for Science Article Classification
Korbinian Friedl | Georgios Rizos | Lukas Stappen | Madina Hasan | Lucia Specia | Thomas Hain | Björn Schuller

pdf bib
Automatically Select Emotion for Response via Personality-affected Emotion Transition
Zhiyuan Wen | Jiannong Cao | Ruosong Yang | Shuaiqi Liu | Jiaxing Shen

pdf bib
Highlight-Transformer: Leveraging Key Phrase Aware Attention to Improve Abstractive Multi-Document Summarization
Shuaiqi Liu | Jiannong Cao | Ruosong Yang | Zhiyuan Wen

pdf bib
Phrase-Level Action Reinforcement Learning for Neural Dialog Response Generation
Takato Yamazaki | Akiko Aizawa

pdf bib
Automatic Speech Recognition in Sanskrit: A New Speech Corpus and Modelling Insights
Devaraja Adiga | Rishabh Kumar | Amrith Krishna | Preethi Jyothi | Ganesh Ramakrishnan | Pawan Goyal

pdf bib
Constraint based Knowledge Base Distillation in End-to-End Task Oriented Dialogs
Dinesh Raghu | Atishya Jain | Mausam | Sachindra Joshi

pdf bib
DialogSum: A Real-Life Scenario Dialogue Summarization Dataset
Yulong Chen | Yang Liu | Liang Chen | Yue Zhang

pdf bib
What Did You Refer to? Evaluating Co-References in Dialogue
Wei-Nan Zhang | Yue Zhang | Hanlin Tang | Zhengyu Zhao | Caihai Zhu | Ting Liu

pdf bib
Beyond Metadata: What Paper Authors Say About Corpora They Use
Nikolay Kolyada | Martin Potthast | Benno Stein

pdf bib
Knowledge Distillation for Quality Estimation
Amit Gajbhiye | Marina Fomicheva | Fernando Alva-Manchego | Frédéric Blain | Abiola Obamuyide | Nikolaos Aletras | Lucia Specia

pdf bib
Cross-document Coreference Resolution over Predicted Mentions
Arie Cattan | Alon Eirew | Gabriel Stanovsky | Mandar Joshi | Ido Dagan

pdf bib
Controllable Abstractive Dialogue Summarization with Sketch Supervision
Chien-Sheng Wu | Linqing Liu | Wenhao Liu | Pontus Stenetorp | Caiming Xiong

pdf bib
Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification
Neha Srikanth | Junyi Jessy Li

pdf bib
Could you give me a hint ? Generating inference graphs for defeasible reasoning
Aman Madaan | Dheeraj Rajagopal | Niket Tandon | Yiming Yang | Eduard Hovy

pdf bib
Characterizing Social Spambots by their Human Traits
Salvatore Giorgi | Lyle Ungar | H. Andrew Schwartz


up

pdf (full)
bib (full)
Findings of the Association for Computational Linguistics: EMNLP 2021

pdf bib
Findings of the Association for Computational Linguistics: EMNLP 2021
Marie-Francine Moens | Xuanjing Huang | Lucia Specia | Scott Wen-tau Yih

pdf bib
K-PLUG: Knowledge-injected Pre-trained Language Model for Natural Language Understanding and Generation in E-Commerce
Song Xu | Haoran Li | Peng Yuan | Yujia Wang | Youzheng Wu | Xiaodong He | Ying Liu | Bowen Zhou

Existing pre-trained language models (PLMs) have demonstrated the effectiveness of self-supervised learning for a broad range of natural language processing (NLP) tasks. However, most of them are not explicitly aware of domain-specific knowledge, which is essential for downstream tasks in many domains, such as tasks in e-commerce scenarios. In this paper, we propose K-PLUG, a knowledge-injected pre-trained language model based on the encoder-decoder transformer that can be transferred to both natural language understanding and generation tasks. Specifically, we propose five knowledge-aware self-supervised pre-training objectives to formulate the learning of domain-specific knowledge, including e-commerce domain-specific knowledge-bases, aspects of product entities, categories of product entities, and unique selling propositions of product entities. We verify our method in a diverse range of e-commerce scenarios that require domain-specific knowledge, including product knowledge base completion, abstractive product summarization, and multi-turn dialogue. K-PLUG significantly outperforms baselines across the board, which demonstrates that the proposed method effectively learns a diverse set of domain-specific knowledge for both language understanding and generation tasks. Our code is available.

pdf bib
Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts
Yiming Wang | Ximing Li | Xiaotang Zhou | Jihong Ouyang

Short text nowadays has become a more fashionable form of text data, e.g., Twitter posts, news titles, and product reviews. Extracting semantic topics from short texts plays a significant role in a wide spectrum of NLP applications, and neural topic modeling is now a major tool to achieve it. Motivated by learning more coherent and semantic topics, in this paper we develop a novel neural topic model named Dual Word Graph Topic Model (DWGTM), which extracts topics from simultaneous word co-occurrence and semantic correlation graphs. To be specific, we learn word features from the global word co-occurrence graph, so as to ingest rich word co-occurrence information; we then generate text features with word features, and feed them into an encoder network to get topic proportions per-text; finally, we reconstruct texts and word co-occurrence graph with topical distributions and word features, respectively. Besides, to capture semantics of words, we also apply word features to reconstruct a word semantic correlation graph computed by pre-trained word embeddings. Upon those ideas, we formulate DWGTM in an auto-encoding paradigm and efficiently train it with the spirit of neural variational inference. Empirical results validate that DWGTM can generate more semantically coherent topics than baseline topic models.

pdf bib
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
Chenyu You | Nuo Chen | Yuexian Zou

Spoken question answering (SQA) requires fine-grained understanding of both spoken documents and questions for the optimal answer prediction. In this paper, we propose novel training schemes for spoken question answering with a self-supervised training stage and a contrastive representation learning stage. In the self-supervised stage, we propose three auxiliary self-supervised tasks, including utterance restoration, utterance insertion, and question discrimination, and jointly train the model to capture consistency and coherence among speech documents without any additional data or annotations. We then propose to learn noise-invariant utterance representations in a contrastive objective by adopting multiple augmentation strategies, including span deletion and span substitution. Besides, we design a Temporal-Alignment attention to semantically align the speech-text clues in the learned common space and benefit the SQA tasks. By this means, the training schemes can more effectively guide the generation model to predict more proper answers. Experimental results show that our model achieves state-of-the-art results on three SQA benchmarks. Our code will be publicly available after publication.

pdf bib
Language Clustering for Multilingual Named Entity Recognition
Kyle Shaffer

Recent work in multilingual natural language processing has shown progress in various tasks such as natural language inference and joint multilingual translation. Despite success in learning across many languages, challenges arise where multilingual training regimes often boost performance on some languages at the expense of others. For multilingual named entity recognition (NER) we propose a simple technique that groups similar languages together by using embeddings from a pre-trained masked language model, and automatically discovering language clusters in this embedding space. Specifically, we fine-tune an XLM-Roberta model on a language identification task, and use embeddings from this model for clustering. We conduct experiments on 15 diverse languages in the WikiAnn dataset and show our technique largely outperforms three baselines: (1) training a multilingual model jointly on all available languages, (2) training one monolingual model per language, and (3) grouping languages by linguistic family. We also conduct analyses showing meaningful multilingual transfer for low-resource languages (Swahili and Yoruba), despite being automatically grouped with other seemingly disparate languages.

pdf bib
Neural News Recommendation with Collaborative News Encoding and Structural User Encoding
Zhiming Mao | Xingshan Zeng | Kam-Fai Wong

Automatic news recommendation has gained much attention from the academic community and industry. Recent studies reveal that the key to this task lies within the effective representation learning of both news and users. Existing works typically encode news title and content separately while neglecting their semantic interaction, which is inadequate for news text comprehension. Besides, previous models encode user browsing history without leveraging the structural correlation of user browsed news to reflect user interests explicitly. In this work, we propose a news recommendation framework consisting of collaborative news encoding (CNE) and structural user encoding (SUE) to enhance news and user representation learning. CNE equipped with bidirectional LSTMs encodes news title and content collaboratively with cross-selection and cross-attention modules to learn semantic-interactive news representations. SUE utilizes graph convolutional networks to extract cluster-structural features of user history, followed by intra-cluster and inter-cluster attention modules to learn hierarchical user interest representations. Experiment results on the MIND dataset validate the effectiveness of our model to improve the performance of news recommendation.

pdf bib
Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data
Dian Yu | Kai Sun | Dong Yu | Claire Cardie

Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create large-scale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, Cˆ3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QA-based weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at https://dataset.org/examqa/.

pdf bib
A Web Scale Entity Extraction System
Xuanting Cai | Quanbin Ma | Jianyu Liu | Pan Li | Qi Zeng | Zhengkan Yang | Pushkar Tripathi

Understanding the semantic meaning of content on the web through the lens of entities and concepts has many practical advantages. However, when building large-scale entity extraction systems, practitioners are facing unique challenges involving finding the best ways to leverage the scale and variety of data available on internet platforms. We present learnings from our efforts in building an entity extraction system for multiple document types at large scale using multi-modal Transformers. We empirically demonstrate the effectiveness of multi-lingual, multi-task and cross-document type learning. We also discuss the label collection schemes that help to minimize the amount of noise in the collected data.

pdf bib
Joint Multimedia Event Extraction from Video and Article
Brian Chen | Xudong Lin | Christopher Thomas | Manling Li | Shoya Yoshida | Lovish Chum | Heng Ji | Shih-Fu Chang

Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from both video and text articles. We introduce the new task of Video MultiMedia Event Extraction and propose two novel components to build the first system towards this task. First, we propose the first self-supervised cross-modal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first cross-modal transformer architecture, which extracts structured event information from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0% and 5.8% absolute F-score gain on multimodal event coreference resolution and multimedia event extraction.

pdf bib
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding
Yuechen Wang | Wengang Zhou | Houqiang Li

Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels,we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.

pdf bib
Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation
Yuexiang Xie | Fei Sun | Yang Deng | Yaliang Li | Bolin Ding

Despite significant progress has been achieved in text summarization, factual inconsistency in generated summaries still severely limits its practical applications. Among the key factors to ensure factual consistency, a reliable automatic evaluation metric is the first and the most crucial one. However, existing metrics either neglect the intrinsic cause of the factual inconsistency or rely on auxiliary tasks, leading to an unsatisfied correlation with human judgments or increasing the inconvenience of usage in practice. In light of these challenges, we propose a novel metric to evaluate the factual consistency in text summarization via counterfactual estimation, which formulates the causal relationship among the source document, the generated summary, and the language prior. We remove the effect of language prior, which can cause factual inconsistency, from the total causal effect on the generated summary, and provides a simple yet effective way to evaluate consistency without relying on other auxiliary tasks. We conduct a series of experiments on three public abstractive text summarization datasets, and demonstrate the advantages of the proposed metric in both improving the correlation with human judgments and the convenience of usage. The source code is available at https://github.com/xieyxclack/factual_coco.

pdf bib
Cross-Modal Retrieval Augmentation for Multi-Modal Classification
Shir Gur | Natalia Neverova | Chris Stauffer | Ser-Nam Lim | Douwe Kiela | Austin Reiter

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

pdf bib
HiTRANS: A Hierarchical Transformer Network for Nested Named Entity Recognition
Zhiwei Yang | Jing Ma | Hechang Chen | Yunke Zhang | Yi Chang

Nested Named Entity Recognition (NNER) has been extensively studied, aiming to identify all nested entities from potential spans (i.e., one or more continuous tokens). However, recent studies for NNER either focus on tedious tagging schemas or utilize complex structures, which fail to learn effective span representations from the input sentence with highly nested entities. Intuitively, explicit span representations will contribute to NNER due to the rich context information they contain. In this study, we propose a Hierarchical Transformer (HiTRANS) network for the NNER task, which decomposes the input sentence into multi-grained spans and enhances the representation learning in a hierarchical manner. Specifically, we first utilize a two-phase module to generate span representations by aggregating context information based on a bottom-up and top-down transformer network. Then a label prediction layer is designed to recognize nested entities hierarchically, which naturally explores semantic dependencies among different spans. Experiments on GENIA, ACE-2004, ACE-2005 and NNE datasets demonstrate that our proposed method achieves much better performance than the state-of-the-art approaches.

pdf bib
Improving Embedding-based Large-scale Retrieval via Label Enhancement
Peiyang Liu | Xi Wang | Sen Wang | Wei Ye | Xiangyu Xi | Shikun Zhang

Current embedding-based large-scale retrieval models are trained with 0-1 hard label that indicates whether a query is relevant to a document, ignoring rich information of the relevance degree. This paper proposes to improve embedding-based retrieval from the perspective of better characterizing the query-document relevance degree by introducing label enhancement (LE) for the first time. To generate label distribution in the retrieval scenario, we design a novel and effective supervised LE method that incorporates prior knowledge from dynamic term weighting methods into contextual embeddings. Our method significantly outperforms four competitive existing retrieval models and its counterparts equipped with two alternative LE techniques by training models with the generated label distribution as auxiliary supervision information. The superiority can be easily observed on English and Chinese large-scale retrieval tasks under both standard and cold-start settings.

pdf bib
Improving Privacy Guarantee and Efficiency of Latent Dirichlet Allocation Model Training Under Differential Privacy
Tao Huang | Hong Chen

Latent Dirichlet allocation (LDA), a widely used topic model, is often employed as a fundamental tool for text analysis in various applications. However, the training process of the LDA model typically requires massive text corpus data. On one hand, such massive data may expose private information in the training data, thereby incurring significant privacy concerns. On the other hand, the efficiency of the LDA model training may be impacted, since LDA training often needs to handle these massive text corpus data. To address the privacy issues in LDA model training, some recent works have combined LDA training algorithms that are based on collapsed Gibbs sampling (CGS) with differential privacy. Nevertheless, these works usually have a high accumulative privacy budget due to vast iterations in CGS. Moreover, these works always have low efficiency due to handling massive text corpus data. To improve the privacy guarantee and efficiency, we combine a subsampling method with CGS and propose a novel LDA training algorithm with differential privacy, SUB-LDA. We find that subsampling in CGS naturally improves efficiency while amplifying privacy. We propose a novel metric, the efficiency–privacy function, to evaluate improvements of the privacy guarantee and efficiency. Based on a conventional subsampling method, we propose an adaptive subsampling method to improve the model’s utility produced by SUB-LDA when the subsampling ratio is small. We provide a comprehensive analysis of SUB-LDA, and the experiment results validate its efficiency and privacy guarantee improvements.

pdf bib
Generating Mammography Reports from Multi-view Mammograms with BERT
Alexander Yalunin | Elena Sokolova | Ilya Burenko | Alexander Ponomarchuk | Olga Puchkova | Dmitriy Umerenkov

Writing mammography reports can be error-prone and time-consuming for radiologists. In this paper we propose a method to generate mammography reports given four images, corresponding to the four views used in screening mammography. To the best of our knowledge our work represents the first attempt to generate the mammography report using deep-learning. We propose an encoder-decoder model that includes an EfficientNet-based encoder and a Transformer-based decoder. We demonstrate that the Transformer-based attention mechanism can combine visual and semantic information to localize salient regions on the input mammograms and generate a visually interpretable report. The conducted experiments, including an evaluation by a certified radiologist, show the effectiveness of the proposed method.

pdf bib
Euphemistic Phrase Detection by Masked Language Model
Wanzheng Zhu | Suma Bhat

It is a well-known approach for fringe groups and organizations to use euphemisms—ordinary-sounding and innocent-looking words with a secret meaning—to conceal what they are discussing. For instance, drug dealers often use “pot” for marijuana and “avocado” for heroin. From a social media content moderation perspective, though recent advances in NLP have enabled the automatic detection of such single-word euphemisms, no existing work is capable of automatically detecting multi-word euphemisms, such as “blue dream” (marijuana) and “black tar” (heroin). Our paper tackles the problem of euphemistic phrase detection without human effort for the first time, as far as we are aware. We first perform phrase mining on a raw text corpus (e.g., social media posts) to extract quality phrases. Then, we utilize word embedding similarities to select a set of euphemistic phrase candidates. Finally, we rank those candidates by a masked language model—SpanBERT. Compared to strong baselines, we report 20-50% higher detection accuracies using our algorithm for detecting euphemistic phrases.

pdf bib
Decomposing Complex Questions Makes Multi-Hop QA Easier and More Interpretable
Ruiliu Fu | Han Wang | Xuejun Zhang | Jun Zhou | Yonghong Yan

Multi-hop QA requires the machine to answer complex questions through finding multiple clues and reasoning, and provide explanatory evidence to demonstrate the machine’s reasoning process. We propose Relation Extractor-Reader and Comparator (RERC), a three-stage framework based on complex question decomposition. The Relation Extractor decomposes the complex question, and then the Reader answers the sub-questions in turn, and finally the Comparator performs numerical comparison and summarizes all to get the final answer, where the entire process itself constitutes a complete reasoning evidence path. In the 2WikiMultiHopQA dataset, our RERC model has achieved the state-of-the-art performance, with a winning joint F1 score of 53.58 on the leaderboard. All indicators of our RERC are close to human performance, with only 1.95 behind the human level in F1 score of support fact. At the same time, the evidence path provided by our RERC framework has excellent readability and faithfulness.

pdf bib
Segmenting Natural Language Sentences via Lexical Unit Analysis
Yangming Li | Lemao Liu | Shuming Shi

The span-based model enjoys great popularity in recent works of sequence segmentation. However, each of these methods suffers from its own defects, such as invalid predictions. In this work, we introduce a unified span-based model, lexical unit analysis (LUA), that addresses all these matters. Segmenting a lexical unit sequence involves two steps. Firstly, we embed every span by using the representations from a pretraining language model. Secondly, we define a score for every segmentation candidate and apply dynamic programming (DP) to extract the candidate with the maximum score. We have conducted extensive experiments on 3 tasks, (e.g., syntactic chunking), across 7 datasets. LUA has established new state-of-the-art performances on 6 of them. We have achieved even better results through incorporating label correlations.

pdf bib
Dense Hierarchical Retrieval for Open-domain Question Answering
Ye Liu | Kazuma Hashimoto | Yingbo Zhou | Semih Yavuz | Caiming Xiong | Philip Yu

Dense neural text retrieval has achieved promising results on open-domain Question Answering (QA), where latent representations of questions and passages are exploited for maximum inner product search in the retrieval process. However, current dense retrievers require splitting documents into short passages that usually contain local, partial and sometimes biased context, and highly depend on the splitting process. As a consequence, it may yield inaccurate and misleading hidden representations, thus deteriorating the final retrieval result. In this work, we propose Dense Hierarchical Retrieval (DHR), a hierarchical framework which can generate accurate dense representations of passages by utilizing both macroscopic semantics in the document and microscopic semantics specific to each passage. Specifically, a document-level retriever first identifies relevant documents, among which relevant passages are then retrieved by a passage-level retriever. The ranking of the retrieved passages will be further calibrated by examining the document-level relevance. In addition, hierarchical title structure and two negative sampling strategies (i.e., In-Doc and In-Sec negatives) are investigated. We apply DHR to large-scale open-domain QA datasets. DHR significantly outperforms the original dense passage retriever, and helps an end-to-end QA system outperform the strong baselines on multiple open-domain QA benchmarks.

pdf bib
Visually Grounded Concept Composition
Bowen Zhang | Hexiang Hu | Linlu Qiu | Peter Shaw | Fei Sha

We investigate ways to compose complex concepts in texts from primitive ones while grounding them in images. We propose Concept and Relation Graph (CRG), which builds on top of constituency analysis and consists of recursively combined concepts with predicate functions. Meanwhile, we propose a concept composition neural network called Composer to leverage the CRG for visually grounded concept learning. Specifically, we learn the grounding of both primitive and all composed concepts by aligning them to images and show that learning to compose leads to more robust grounding results, measured in text-to-image matching accuracy. Notably, our model can model grounded concepts forming at both the finer-grained sentence level and the coarser-grained intermediate level (or word-level). Composer leads to pronounced improvement in matching accuracy when the evaluation data has significant compound divergence from the training data.

pdf bib
Compositional Networks Enable Systematic Generalization for Grounded Language Understanding
Yen-Ling Kuo | Boris Katz | Andrei Barbu

Humans are remarkably flexible when understanding new sentences that include combinations of concepts they have never encountered before. Recent work has shown that while deep networks can mimic some human language abilities when presented with novel sentences, systematic variation uncovers the limitations in the language-understanding abilities of networks. We demonstrate that these limitations can be overcome by addressing the generalization challenges in the gSCAN dataset, which explicitly measures how well an agent is able to interpret novel linguistic commands grounded in vision, e.g., novel pairings of adjectives and nouns. The key principle we employ is compositionality: that the compositional structure of networks should reflect the compositional structure of the problem domain they address, while allowing other parameters to be learned end-to-end. We build a general-purpose mechanism that enables agents to generalize their language understanding to compositional domains. Crucially, our network has the same state-of-the-art performance as prior work while generalizing its knowledge when prior work does not. Our network also provides a level of interpretability that enables users to inspect what each part of networks learns. Robust grounded language understanding without dramatic failures and without corner cases is critical to building safe and fair robots; we demonstrate the significant role that compositionality can play in achieving that goal.

pdf bib
An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages
Xinyu Lu | Jipeng Qiang | Yun Li | Yunhao Yuan | Yi Zhu

The availability of parallel sentence simplification (SS) is scarce for neural SS modelings. We propose an unsupervised method to build SS corpora from large-scale bilingual translation corpora, alleviating the need for SS supervised corpora. Our method is motivated by the following two findings: neural machine translation model usually tends to generate more high-frequency tokens and the difference of text complexity levels exists between the source and target language of a translation corpus. By taking the pair of the source sentences of translation corpus and the translations of their references in a bridge language, we can construct large-scale pseudo parallel SS data. Then, we keep these sentence pairs with a higher complexity difference as SS sentence pairs. The building SS corpora with an unsupervised approach can satisfy the expectations that the aligned sentences preserve the same meanings and have difference in text complexity levels. Experimental results show that SS methods trained by our corpora achieve the state-of-the-art results and significantly outperform the results on English benchmark WikiLarge.

pdf bib
WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
Junjie Huang | Duyu Tang | Wanjun Zhong | Shuai Lu | Linjun Shou | Ming Gong | Daxin Jiang | Nan Duan

Producing the embedding of a sentence in anunsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on fourpretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have three main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both topand bottom layers is better than only using toplayers. Lastly, an easy whitening-based vector normalization strategy with less than 10 linesof code consistently boosts the performance. The whole project including codes and data is publicly available at https://github.com/Jun-jie-Huang/WhiteningBERT.

pdf bib
TWEETSUMM - A Dialog Summarization Dataset for Customer Service
Guy Feigenblat | Chulaka Gunasekara | Benjamin Sznajder | Sachindra Joshi | David Konopnicki | Ranit Aharonov

In a typical customer service chat scenario, customers contact a support center to ask for help or raise complaints, and human agents try to solve the issues. In most cases, at the end of the conversation, agents are asked to write a short summary emphasizing the problem and the proposed solution, usually for the benefit of other agents that may have to deal with the same customer or issue. The goal of the present article is advancing the automation of this task. We introduce the first large scale, high quality, customer care dialog summarization dataset with close to 6500 human annotated summaries. The data is based on real-world customer support dialogs and includes both extractive and abstractive summaries. We also introduce a new unsupervised, extractive summarization method specific to dialogs.

pdf bib
Discourse-Based Sentence Splitting
Liam Cripwell | Joël Legrand | Claire Gardent

Sentence splitting involves the segmentation of a sentence into two or more shorter sentences. It is a key component of sentence simplification, has been shown to help human comprehension and is a useful preprocessing step for NLP tasks such as summarisation and relation extraction. While several methods and datasets have been proposed for developing sentence splitting models, little attention has been paid to how sentence splitting interacts with discourse structure. In this work, we focus on cases where the input text contains a discourse connective, which we refer to as discourse-based sentence splitting. We create synthetic and organic datasets for discourse-based splitting and explore different ways of combining these datasets using different model architectures. We show that pipeline models which use discourse structure to mediate sentence splitting outperform end-to-end models in learning the various ways of expressing a discourse relation but generate text that is less grammatical; that large scale synthetic data provides a better basis for learning than smaller scale organic data; and that training on discourse-focused, rather than on general sentence splitting data provides a better basis for discourse splitting.

pdf bib
Multi-Task Dense Retrieval via Model Uncertainty Fusion for Open-Domain Question Answering
Minghan Li | Ming Li | Kun Xiong | Jimmy Lin

Multi-task dense retrieval models can be used to retrieve documents from a common corpus (e.g., Wikipedia) for different open-domain question-answering (QA) tasks. However, Karpukhin et al. (2020) shows that jointly learning different QA tasks with one dense model is not always beneficial due to corpus inconsistency. For example, SQuAD only focuses on a small set of Wikipedia articles while datasets like NQ and Trivia cover more entries, and joint training on their union can cause performance degradation. To solve this problem, we propose to train individual dense passage retrievers (DPR) for different tasks and aggregate their predictions during test time, where we use uncertainty estimation as weights to indicate how probable a specific query belongs to each expert’s expertise. Our method reaches state-of-the-art performance on 5 benchmark QA datasets, with up to 10% improvement in top-100 accuracy compared to a joint-training multi-task DPR on SQuAD. We also show that our method handles corpus inconsistency better than the joint-training DPR on a mixed subset of different QA datasets. Code and data are available at https://github.com/alexlimh/DPR_MUF.

pdf bib
Mining the Cause of Political Decision-Making from Social Media: A Case Study of COVID-19 Policies across the US States
Zhijing Jin | Zeyu Peng | Tejas Vaidhya | Bernhard Schoelkopf | Rada Mihalcea

Mining the causes of political decision-making is an active research area in the field of political science. In the past, most studies have focused on long-term policies that are collected over several decades of time, and have primarily relied on surveys as the main source of predictors. However, the recent COVID-19 pandemic has given rise to a new political phenomenon, where political decision-making consists of frequent short-term decisions, all on the same controlled topic—the pandemic. In this paper, we focus on the question of how public opinion influences policy decisions, while controlling for confounders such as COVID-19 case increases or unemployment rates. Using a dataset consisting of Twitter data from the 50 US states, we classify the sentiments toward governors of each state, and conduct controlled studies and comparisons. Based on the compiled samples of sentiments, policies, and confounders, we conduct causal inference to discover trends in political decision-making across different states.

pdf bib
Self-Attention Graph Residual Convolutional Networks for Event Detection with dependency relations
Anan Liu | Ning Xu | Haozhe Liu

Event detection (ED) task aims to classify events by identifying key event trigger words embedded in a piece of text. Previous research have proved the validity of fusing syntactic dependency relations into Graph Convolutional Networks(GCN). While existing GCN-based methods explore latent node-to-node dependency relations according to a stationary adjacency tensor, an attention-based dynamic tensor, which can pay much attention to the key node like event trigger or its neighboring nodes, has not been developed. Simultaneously, suffering from the phenomenon of graph information vanishing caused by the symmetric adjacency tensor, existing GCN models can not achieve higher overall performance. In this paper, we propose a novel model Self-Attention Graph Residual Convolution Networks (SA-GRCN) to mine node-to-node latent dependency relations via self-attention mechanism and introduce Graph Residual Network (GResNet) to solve graph information vanishing problem. Specifically, a self-attention module is constructed to generate an attention tensor, representing the dependency attention scores of all words in the sentence. Furthermore, a graph residual term is added to the baseline SA-GCN to construct a GResNet. Considering the syntactically connection of the network input, we initialize the raw adjacency tensor without processed by the self-attention module as the residual term. We conduct experiments on the ACE2005 dataset and the results show significant improvement over competitive baseline methods.

pdf bib
Mixup Decoding for Diverse Machine Translation
Jicheng Li | Pengzhi Gao | Xuanfu Wu | Yang Feng | Zhongjun He | Hua Wu | Haifeng Wang

Diverse machine translation aims at generating various target language translations for a given source language sentence. To leverage the linear relationship in the sentence latent space introduced by the mixup training, we propose a novel method, MixDiversity, to generate different translations for the input sentence by linearly interpolating it with different sentence pairs sampled from the training corpus during decoding. To further improve the faithfulness and diversity of the translations, we propose two simple but effective approaches to select diverse sentence pairs in the training corpus and adjust the interpolation weight for each pair correspondingly. Moreover, by controlling the interpolation weight, our method can achieve the trade-off between faithfulness and diversity without any additional training, which is required in most of the previous methods. Experiments on WMT’16 en-ro, WMT’14 en-de, and WMT’17 zh-en are conducted to show that our method substantially outperforms all previous diverse machine translation methods.

pdf bib
An Alignment-Agnostic Model for Chinese Text Error Correction
Liying Zheng | Yue Deng | Weishun Song | Liang Xu | Jing Xiao

This paper investigates how to correct Chinese text errors with types of mistaken, missing and redundant characters, which are common for Chinese native speakers. Most existing models based on detect-correct framework can correct mistaken characters, but cannot handle missing or redundant characters due to inconsistency between model inputs and outputs. Although Seq2Seq-based or sequence tagging methods provide solutions to the three error types and achieved relatively good results in English context, they do not perform well in Chinese context according to our experiments. In our work, we propose a novel alignment-agnostic detect-correct framework that can handle both text aligned and non-aligned situations and can serve as a cold start model when no annotation data are provided. Experimental results on three datasets demonstrate that our method is effective and achieves a better performance than most recent published models.

pdf bib
Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer
Gi-Cheon Kang | Junseok Park | Hwaran Lee | Byoung-Tak Zhang | Jin-Hwa Kim

Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain multiple reasonable answers. As a result, our proposed model significantly improves reasoning capability compared to baseline methods and outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. The source code is available at https://github.com/gicheonkang/SGLKT-VisDial.

pdf bib
Exploring Sentence Community for Document-Level Event Extraction
Yusheng Huang | Weijia Jia

Document-level event extraction is critical to various natural language processing tasks for providing structured information. Existing approaches by sequential modeling neglect the complex logic structures for long texts. In this paper, we leverage the entity interactions and sentence interactions within long documents and transform each document into an undirected unweighted graph by exploiting the relationship between sentences. We introduce the Sentence Community to represent each event as a subgraph. Furthermore, our framework SCDEE maintains the ability to extract multiple events by sentence community detection using graph attention networks and alleviate the role overlapping issue by predicting arguments in terms of roles. Experiments demonstrate that our framework achieves competitive results over state-of-the-art methods on the large-scale document-level event extraction dataset.

pdf bib
A Model of Cross-Lingual Knowledge-Grounded Response Generation for Open-Domain Dialogue Systems
San Kim | Jin Yea Jang | Minyoung Jung | Saim Shin

Research on open-domain dialogue systems that allow free topics is challenging in the field of natural language processing (NLP). The performance of the dialogue system has been improved recently by the method utilizing dialogue-related knowledge; however, non-English dialogue systems suffer from reproducing the performance of English dialogue systems because securing knowledge in the same language with the dialogue system is relatively difficult. Through experiments with a Korean dialogue system, this paper proves that the performance of a non-English dialogue system can be improved by utilizing English knowledge, highlighting the system uses cross-lingual knowledge. For the experiments, we 1) constructed a Korean version of the Wizard of Wikipedia dataset, 2) built Korean-English T5 (KE-T5), a language model pre-trained with Korean and English corpus, and 3) developed a knowledge-grounded Korean dialogue model based on KE-T5. We observed the performance improvement in the open-domain Korean dialogue model even only English knowledge was given. The experimental results showed that the knowledge inherent in cross-lingual language models can be helpful for generating responses in open dialogue systems.

pdf bib
WHOSe Heritage: Classification of UNESCO World Heritage Statements of "Outstanding Universal Value” with Soft Labels
Nan Bai | Renqian Luo | Pirouz Nourian | Ana Pereira Roders

The UNESCO World Heritage List (WHL) includes the exceptionally valuable cultural and natural heritage to be preserved for mankind. Evaluating and justifying the Outstanding Universal Value (OUV) is essential for each site inscribed in the WHL, and yet a complex task, even for experts, since the selection criteria of OUV are not mutually exclusive. Furthermore, manual annotation of heritage values and attributes from multi-source textual data, which is currently dominant in heritage studies, is knowledge-demanding and time-consuming, impeding systematic analysis of such authoritative documents in terms of their implications on heritage management. This study applies state-of-the-art NLP models to build a classifier on a new dataset containing Statements of OUV, seeking an explainable and scalable automation tool to facilitate the nomination, evaluation, research, and monitoring processes of World Heritage sites. Label smoothing is innovatively adapted to improve the model performance by adding prior inter-class relationship knowledge to generate soft labels. The study shows that the best models fine-tuned from BERT and ULMFiT can reach 94.3% top-3 accuracy. A human study with expert evaluation on the model prediction shows that the models are sufficiently generalizable. The study is promising to be further developed and applied in heritage research and practice.

pdf bib
P-INT: A Path-based Interaction Model for Few-shot Knowledge Graph Completion
Jingwen Xu | Jing Zhang | Xirui Ke | Yuxiao Dong | Hong Chen | Cuiping Li | Yongbin Liu

Few-shot knowledge graph completion is to infer the unknown facts (i.e., query head-tail entity pairs) of a given relation with only a few observed reference entity pairs. Its general process is to first encode the implicit relation of an entity pair and then match the relation of a query entity pair with the relations of the reference entity pairs. Most existing methods have thus far encoded an entity pair and matched entity pairs by using the direct neighbors of concerned entities. In this paper, we propose the P-INT model for effective few-shot knowledge graph completion. First, P-INT infers and leverages the paths that can expressively encode the relation of two entities. Second, to capture the fine grained matches, P-INT calculates the interactions of paths instead of mix- ing them for each entity pair. Extensive experimental results demonstrate that P-INT out- performs the state-of-the-art baselines by 11.2– 14.2% in terms of Hits@1. Our codes and datasets are online now.

pdf bib
Cartography Active Learning
Mike Zhang | Barbara Plank

We propose Cartography Active Learning (CAL), a novel Active Learning (AL) algorithm that exploits the behavior of the model on individual instances during training as a proxy to find the most informative instances for labeling. CAL is inspired by data maps, which were recently proposed to derive insights into dataset quality (Swayamdipta et al., 2020). We compare our method on popular text classification tasks to commonly used AL strategies, which instead rely on post-training behavior. We demonstrate that CAL is competitive to other common AL methods, showing that training dynamics derived from small seed data can be successfully used for AL. We provide insights into our new AL method by analyzing batch-level statistics utilizing the data maps. Our results further show that CAL results in a more data-efficient learning strategy, achieving comparable or better results with considerably less training data.

pdf bib
Beyond Reptile: Meta-Learned Dot-Product Maximization between Gradients for Improved Single-Task Regularization
Akhil Kedia | Sai Chetan Chinthakindi | Wonho Ryu

Meta-learning algorithms such as MAML, Reptile, and FOMAML have led to improved performance of several neural models. The primary difference between standard gradient descent and these meta-learning approaches is that they contain as a small component the gradient for maximizing dot-product between gradients of batches, leading to improved generalization. Previous work has shown that aligned gradients are related to generalization, and have also used the Reptile algorithm in a single-task setting to improve generalization. Inspired by these approaches for a single task setting, this paper proposes to use the finite differences first-order algorithm to calculate this gradient from dot-product of gradients, allowing explicit control on the weightage of this component relative to standard gradients. We use this gradient as a regularization technique, leading to more aligned gradients between different batches. By using the finite differences approximation, our approach does not suffer from O(nˆ2) memory usage of naively calculating the Hessian and can be easily applied to large models with large batch sizes. Our approach achieves state-of-the-art performance on the Gigaword dataset, and shows performance improvements on several datasets such as SQuAD-v2.0, Quasar-T, NewsQA and all the SuperGLUE datasets, with a range of models such as BERT, RoBERTa and ELECTRA. Our method also outperforms previous approaches of Reptile and FOMAML when used as a regularization technique, in both single and multi-task settings. Our method is model agnostic, and introduces no extra trainable weights.

pdf bib
GooAQ: Open Question Answering with Diverse Answer Types
Daniel Khashabi | Amos Ng | Tushar Khot | Ashish Sabharwal | Hannaneh Hajishirzi | Chris Callison-Burch

While day-to-day questions come with a variety of answer types, the current question-answering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GooAQ, a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google’s responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmark T5 models on GooAQ and observe that: (a) in line with recent work, LM’s strong performance on GooAQ’s short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as ‘how’ and ‘why’ questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GooAQ to facilitate further research on improving QA with diverse response types.

pdf bib
Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions
Javier Ferrando | Marta R. Costa-jussà

This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment errors by relying mainly on uninformative tokens from the source sequence. However, we observe that NMT models assign attention to these tokens to regulate the contribution in the prediction of the two contexts, the source and the prefix of the target sequence. We provide evidence about the influence of wrong alignments on the model behavior, demonstrating that the encoder-decoder attention mechanism is well suited as an interpretability method for NMT. Finally, based on our analysis, we propose methods that largely reduce the word alignment error rate compared to standard induced alignments from attention weights.

pdf bib
BFClass: A Backdoor-free Text Classification Framework
Zichao Li | Dheeraj Mekala | Chengyu Dong | Jingbo Shang

Backdoor attack introduces artificial vulnerabilities into the model by poisoning a subset of the training data via injecting triggers and modifying labels. Various trigger design strategies have been explored to attack text classifiers, however, defending such attacks remains an open problem. In this work, we propose BFClass, a novel efficient backdoor-free training framework for text classification. The backbone of BFClass is a pre-trained discriminator that predicts whether each token in the corrupted input was replaced by a masked language model. To identify triggers, we utilize this discriminator to locate the most suspicious token from each training sample and then distill a concise set by considering their association strengths with particular labels. To recognize the poisoned subset, we examine the training samples with these identified triggers as the most suspicious token, and check if removing the trigger will change the poisoned model’s prediction. Extensive experiments demonstrate that BFClass can identify all the triggers, remove 95% poisoned training samples with very limited false alarms, and achieve almost the same performance as the models trained on the benign training data.

pdf bib
Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models
Taeuk Kim | Bowen Li | Sang-goo Lee

As it has been unveiled that pre-trained language models (PLMs) are to some extent capable of recognizing syntactic concepts in natural language, much effort has been made to develop a method for extracting complete (binary) parses from PLMs without training separate parsers. We improve upon this paradigm by proposing a novel chart-based method and an effective top-K ensemble technique. Moreover, we demonstrate that we can broaden the scope of application of the approach into multilingual settings. Specifically, we show that by applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages in an integrated and language-agnostic manner, attaining performance superior or comparable to that of unsupervised PCFGs. We also verify that our approach is robust to cross-lingual transfer. Finally, we provide analyses on the inner workings of our method. For instance, we discover universal attention heads which are consistently sensitive to syntactic information irrespective of the input language.

pdf bib
Hyperbolic Geometry is Not Necessary: Lightweight Euclidean-Based Models for Low-Dimensional Knowledge Graph Embeddings
Kai Wang | Yu Liu | Dan Lin | Michael Sheng

Recent knowledge graph embedding (KGE) models based on hyperbolic geometry have shown great potential in a low-dimensional embedding space. However, the necessity of hyperbolic space in KGE is still questionable, because the calculation based on hyperbolic geometry is much more complicated than Euclidean operations. In this paper, based on the state-of-the-art hyperbolic-based model RotH, we develop two lightweight Euclidean-based models, called RotL and Rot2L. The RotL model simplifies the hyperbolic operations while keeping the flexible normalization effect. Utilizing a novel two-layer stacked transformation and based on RotL, the Rot2L model obtains an improved representation capability, yet costs fewer parameters and calculations than RotH. The experiments on link prediction show that Rot2L achieves the state-of-the-art performance on two widely-used datasets in low-dimensional knowledge graph embeddings. Furthermore, RotL achieves similar performance as RotH but only requires half of the training time.

pdf bib
CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
Lei Li | Yankai Lin | Deli Chen | Shuhuai Ren | Peng Li | Jie Zhou | Xu Sun

Dynamic early exiting aims to accelerate the inference of pre-trained language models (PLMs) by emitting predictions in internal layers without passing through the entire model. In this paper, we empirically analyze the working mechanism of dynamic early exiting and find that it faces a performance bottleneck under high speed-up ratios. On one hand, the PLMs’ representations in shallow layers lack high-level semantic information and thus are not sufficient for accurate predictions. On the other hand, the exiting decisions made by internal classifiers are unreliable, leading to wrongly emitted early predictions. We instead propose a new framework for accelerating the inference of PLMs, CascadeBERT, which dynamically selects proper-sized and complete models in a cascading manner, providing comprehensive representations for predictions. We further devise a difficulty-aware objective, encouraging the model to output the class probability that reflects the real difficulty of each instance for a more reliable cascading mechanism. Experimental results show that CascadeBERT can achieve an overall 15% improvement under 4x speed-up compared with existing dynamic early exiting methods on six classification tasks, yielding more calibrated and accurate predictions.

pdf bib
Semi-supervised Relation Extraction via Incremental Meta Self-Training
Xuming Hu | Chenwei Zhang | Fukun Ma | Chenyao Liu | Lijie Wen | Philip S. Yu

To alleviate human efforts from obtaining large-scale annotations, Semi-Supervised Relation Extraction methods aim to leverage unlabeled data in addition to learning from limited samples. Existing self-training methods suffer from the gradual drift problem, where noisy pseudo labels on unlabeled data are incorporated during training. To alleviate the noise in pseudo labels, we propose a method called MetaSRE, where a Relation Label Generation Network generates accurate quality assessment on pseudo labels by (meta) learning from the successful and failed attempts on Relation Classification Network as an additional meta-objective. To reduce the influence of noisy pseudo labels, MetaSRE adopts a pseudo label selection and exploitation scheme which assesses pseudo label quality on unlabeled samples and only exploits high-quality pseudo labels in a self-training fashion to incrementally augment labeled samples for both robustness and accuracy. Experimental results on two public datasets demonstrate the effectiveness of the proposed approach.

pdf bib
Keyphrase Generation with Fine-Grained Evaluation-Guided Reinforcement Learning
Yichao Luo | Yige Xu | Jiacheng Ye | Xipeng Qiu | Qi Zhang

Aiming to generate a set of keyphrases, Keyphrase Generation (KG) is a classical task for capturing the central idea from a given document. Based on Seq2Seq models, the previous reinforcement learning framework on KG tasks utilizes the evaluation metrics to further improve the well-trained neural models. However, these KG evaluation metrics such as F1@5 and F1@M are only aware of the exact correctness of predictions on phrase-level and ignore the semantic similarities between similar predictions and targets, which inhibits the model from learning deep linguistic patterns. In response to this problem, we propose a new fine-grained evaluation metric to improve the RL framework, which considers different granularities: token-level F1 score, edit distance, duplication, and prediction quantities. On the whole, the new framework includes two reward functions: the fine-grained evaluation score and the vanilla F1 score. This framework helps the model identifying some partial match phrases which can be further optimized as the exact match ones. Experiments on KG benchmarks show that our proposed training framework outperforms the previous RL training frameworks among all evaluation scores. In addition, our method can effectively ease the synonym problem and generate a higher quality prediction. The source code is available at https://github.com/xuyige/FGRL4KG.

pdf bib
Improving Knowledge Graph Embedding Using Affine Transformations of Entities Corresponding to Each Relation
Jinfa Yang | Yongjie Shi | Xin Tong | Robin Wang | Taiyan Chen | Xianghua Ying

To find a suitable embedding for a knowledge graph remains a big challenge nowadays. By using previous knowledge graph embedding methods, every entity in a knowledge graph is usually represented as a k-dimensional vector. As we know, an affine transformation can be expressed in the form of a matrix multiplication followed by a translation vector. In this paper, we firstly utilize a set of affine transformations related to each relation to operate on entity vectors, and then these transformed vectors are used for performing embedding with previous methods. The main advantage of using affine transformations is their good geometry properties with interpretability. Our experimental results demonstrate that the proposed intuitive design with affine transformations provides a statistically significant increase in performance with adding a few extra processing steps or adding a limited number of additional variables. Taking TransE as an example, we employ the scale transformation (the special case of an affine transformation), and only introduce k additional variables for each relation. Surprisingly, it even outperforms RotatE to some extent on various data sets. We also introduce affine transformations into RotatE, Distmult and ComplEx, respectively, and each one outperforms its original method.

pdf bib
Using Question Answering Rewards to Improve Abstractive Summarization
Chulaka Gunasekara | Guy Feigenblat | Benjamin Sznajder | Ranit Aharonov | Sachindra Joshi

Neural abstractive summarization models have drastically improved in the recent years. However, the summaries generated by these models generally suffer from issues such as: not capturing the critical facts in source documents, and containing facts that are inconsistent with the source documents. In this work, we present a general framework to train abstractive summarization models to alleviate such issues. We first train a sequence-to-sequence model to summarize documents, and then further train this model in a Reinforcement Learning setting with question-answering based rewards. We evaluate the summaries generated by the this framework using multiple automatic measures and human judgements. The experimental results show that the question-answering rewards can be used as a general framework to improve neural abstractive summarization. Particularly, the results from human evaluations show that the summaries generated by our approach is preferred over 30% of the time over the summaries generated by general abstractive summarization models.

pdf bib
Effect Generation Based on Causal Reasoning
Feiteng Mu | Wenjie Li | Zhipeng Xie

Causal reasoning aims to predict the future scenarios that may be caused by the observed actions. However, existing causal reasoning methods deal with causalities on the word level. In this paper, we propose a novel event-level causal reasoning method and demonstrate its use in the task of effect generation. In particular, we structuralize the observed cause-effect event pairs into an event causality network, which describes causality dependencies. Given an input cause sentence, a causal subgraph is retrieved from the event causality network and is encoded with the graph attention mechanism, in order to support better reasoning of the potential effects. The most probable effect event is then selected from the causal subgraph and is used as guidance to generate an effect sentence. Experiments show that our method generates more reasonable effect sentences than various well-designed competitors.

pdf bib
Distilling Word Meaning in Context from Pre-trained Language Models
Yuki Arase | Tomoyuki Kajiwara

In this study, we propose a self-supervised learning method that distils representations of word meaning in context from a pre-trained masked language model. Word representations are the basis for context-aware lexical semantics and unsupervised semantic textual similarity (STS) estimation. A previous study transforms contextualised representations employing static word embeddings to weaken excessive effects of contextual information. In contrast, the proposed method derives representations of word meaning in context while preserving useful context information intact. Specifically, our method learns to combine outputs of different hidden layers using self-attention through self-supervised learning with an automatically generated training corpus. To evaluate the performance of the proposed approach, we performed comparative experiments using a range of benchmark tasks. The results confirm that our representations exhibited a competitive performance compared to that of the state-of-the-art method transforming contextualised representations for the context-aware lexical semantic tasks and outperformed it for STS estimation.

pdf bib
Unseen Entity Handling in Complex Question Answering over Knowledge Base via Language Generation
Xin Huang | Jung-Jae Kim | Bowei Zou

Complex question answering over knowledge base remains as a challenging task because it involves reasoning over multiple pieces of information, including intermediate entities/relations and other constraints. Previous methods simplify the SPARQL query of a question into such forms as a list or a graph, missing such constraints as “filter” and “order_by”, and present models specialized for generating those simplified forms from a given question. We instead introduce a novel approach that directly generates an executable SPARQL query without simplification, addressing the issue of generating unseen entities. We adapt large scale pre-trained encoder-decoder models and show that our method significantly outperforms the previous methods and also that our method has higher interpretability and computational efficiency than the previous methods.

pdf bib
Bidirectional Hierarchical Attention Networks based on Document-level Context for Emotion Cause Extraction
Guimin Hu | Guangming Lu | Yi Zhao

Emotion cause extraction (ECE) aims to extract the causes behind the certain emotion in text. Some works related to the ECE task have been published and attracted lots of attention in recent years. However, these methods neglect two major issues: 1) pay few attentions to the effect of document-level context information on ECE, and 2) lack of sufficient exploration for how to effectively use the annotated emotion clause. For the first issue, we propose a bidirectional hierarchical attention network (BHA) corresponding to the specified candidate cause clause to capture the document-level context in a structured and dynamic manner. For the second issue, we design an emotional filtering module (EF) for each layer of the graph attention network, which calculates a gate score based on the emotion clause to filter the irrelevant information. Combining the BHA and EF, the EF-BHA can dynamically aggregate the contextual information from two directions and filters irrelevant information. The experimental results demonstrate that EF-BHA achieves the competitive performances on two public datasets in different languages (Chinese and English). Moreover, we quantify the effect of context on emotion cause extraction and provide the visualization of the interactions between candidate cause clauses and contexts.

pdf bib
Distantly Supervised Relation Extraction in Federated Settings
Dianbo Sui | Yubo Chen | Kang Liu | Jun Zhao

In relation extraction, distant supervision is widely used to automatically label a large-scale training dataset by aligning a knowledge base with unstructured text. Most existing studies in this field have assumed there is a great deal of centralized unstructured text. However, in practice, texts are usually distributed on different platforms and cannot be centralized due to privacy restrictions. Therefore, it is worthwhile to investigate distant supervision in the federated learning paradigm, which decouples the training of the model from the need for direct access to raw texts. However, overcoming label noise of distant supervision becomes more difficult in federated settings, because texts containing the same entity pair scatter around different platforms. In this paper, we propose a federated denoising framework to suppress label noise in federated settings. The key of this framework is a multiple instance learning based denoising method that is able to select reliable sentences via cross-platform collaboration. Various experiments on New York Times dataset and miRNA gene regulation relation dataset demonstrate the effectiveness of the proposed method.

pdf bib
Casting the Same Sentiment Classification Problem
Erik Körner | Ahmad Dawar Hakimi | Gerhard Heyer | Martin Potthast

We introduce and study a problem variant of sentiment analysis, namely the “same sentiment classification problem”, where, given a pair of texts, the task is to determine if they have the same sentiment, disregarding the actual sentiment polarity. Among other things, our goal is to enable a more topic-agnostic sentiment classification. We study the problem using the Yelp business review dataset, demonstrating how sentiment data needs to be prepared for this task, and then carry out sequence pair classification using the BERT language model. In a series of experiments, we achieve an accuracy above 83% for category subsets across topics, and 89% on average.

pdf bib
Detecting Compositionally Out-of-Distribution Examples in Semantic Parsing
Denis Lukovnikov | Sina Daubener | Asja Fischer

While neural networks are ubiquitous in state-of-the-art semantic parsers, it has been shown that most standard models suffer from dramatic performance losses when faced with compositionally out-of-distribution (OOD) data. Recently several methods have been proposed to improve compositional generalization in semantic parsing. In this work we instead focus on the problem of detecting compositionally OOD examples with neural semantic parsers, which, to the best of our knowledge, has not been investigated before. We investigate several strong yet simple methods for OOD detection based on predictive uncertainty. The experimental results demonstrate that these techniques perform well on the standard SCAN and CFQ datasets. Moreover, we show that OOD detection can be further improved by using a heterogeneous ensemble.

pdf bib
Saliency-based Multi-View Mixed Language Training for Zero-shot Cross-lingual Classification
Siyu Lai | Hui Huang | Dong Jing | Yufeng Chen | Jinan Xu | Jian Liu

Recent multilingual pre-trained models, like XLM-RoBERTa (XLM-R), have been demonstrated effective in many cross-lingual tasks. However, there are still gaps between the contextualized representations of similar words in different languages. To solve this problem, we propose a novel framework named Multi-View Mixed Language Training (MVMLT), which leverages code-switched data with multi-view learning to fine-tune XLM-R. MVMLT uses gradient-based saliency to extract keywords which are the most relevant to downstream tasks and replaces them with the corresponding words in the target language dynamically. Furthermore, MVMLT utilizes multi-view learning to encourage contextualized embeddings to align into a more refined language-invariant space. Extensive experiments with four languages show that our model achieves state-of-the-art results on zero-shot cross-lingual sentiment classification and dialogue state tracking tasks, demonstrating the effectiveness of our proposed model.

pdf bib
Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society
Firoj Alam | Shaden Shaar | Fahim Dalvi | Hassan Sajjad | Alex Nikolov | Hamdy Mubarak | Giovanni Da San Martino | Ahmed Abdelali | Nadir Durrani | Kareem Darwish | Abdulaziz Al-Homaid | Wajdi Zaghouani | Tommaso Caselli | Gijs Danoe | Friso Stolk | Britt Bruntink | Preslav Nakov

With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that (i) focuses on COVID-19, (ii) combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and (iii) covers Arabic, Bulgarian, Dutch, and English. Finally, we show strong evaluation results using pretrained Transformers, thus confirming the practical utility of the dataset in monolingual vs. multilingual, and single task vs. multitask settings.

pdf bib
FANATIC: FAst Noise-Aware TopIc Clustering
Ari Silburt | Anja Subasic | Evan Thompson | Carmeline Dsilva | Tarec Fares

Extracting salient topics from a collection of documents can be a challenging task when a) the amount of data is large, b) the number of topics is not known a priori, and/or c) “topic noise” is present. We define “topic noise” as the collection of documents that are irrelevant to any coherent topic and should be filtered out. By design, most clustering algorithms (e.g. k-means, hierarchical clustering) assign all input documents to one of the available clusters, guaranteeing any topic noise to propagate into the result. To address these challenges, we present a novel algorithm, FANATIC, that efficiently distinguishes documents from genuine topics and those that are topic noise. We also introduce a new Reddit dataset to showcase FANATIC as it contains short, noisy data that is difficult to cluster using most clustering algorithms. We find that FANATIC clusters 500k Reddit titles (of which 20% are topic noise) in 2 minutes and achieves an AMI score of 0.59, in contrast with hdbscan (McInnes et al., 2017), a popular algorithm suited for this type of task, which requires over 7 hours and achieves an AMI of 0.03. Finally, we test FANATIC against a Twitter dataset and find again that it outperforms the other algorithms with an AMI score of 0.60. We make our code and data publicly available.

pdf bib
Stream-level Latency Evaluation for Simultaneous Machine Translation
Javier Iranzo-Sánchez | Jorge Civera Saiz | Alfons Juan

Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation, resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.

pdf bib
TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning
Kexin Wang | Nils Reimers | Iryna Gurevych

Learning sentence embeddings often requires a large amount of labeled data. However, for most tasks and domains, labeled data is seldom available and creating it is expensive. In this work, we present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE) which outperforms previous approaches by up to 6.4 points. It can achieve up to 93.1% of the performance of in-domain supervised approaches. Further, we show that TSDAE is a strong domain adaptation and pre-training method for sentence embeddings, significantly outperforming other approaches like Masked Language Model. A crucial shortcoming of previous studies is the narrow evaluation: Most work mainly evaluates on the single task of Semantic Textual Similarity (STS), which does not require any domain knowledge. It is unclear if these proposed methods generalize to other domains and tasks. We fill this gap and evaluate TSDAE and other recent approaches on four different datasets from heterogeneous domains.

pdf bib
How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?
Chantal Amrhein | Rico Sennrich

Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semi-synthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for non-concatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.

pdf bib
Rethinking Why Intermediate-Task Fine-Tuning Works
Ting-Yun Chang | Chi-Jen Lu

Supplementary Training on Intermediate Labeled-data Tasks (STILT) is a widely applied technique, which first fine-tunes the pretrained language models on an intermediate task before on the target task of interest. While STILT is able to further improve the performance of pretrained language models, it is still unclear why and when it works. Previous research shows that those intermediate tasks involving complex inference, such as commonsense reasoning, work especially well for RoBERTa-large. In this paper, we discover that the improvement from an intermediate task could be orthogonal to it containing reasoning or other complex skills — a simple real-fake discrimination task synthesized by GPT2 can benefit diverse target tasks. We conduct extensive experiments to study the impact of different factors on STILT. These findings suggest rethinking the role of intermediate fine-tuning in the STILT pipeline.

pdf bib
Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning
Xisen Jin | Bill Yuchen Lin | Mohammad Rostami | Xiang Ren

The ability to continuously expand knowledge over time and utilize it to rapidly generalize to new tasks is a key feature of human linguistic intelligence. Existing models that pursue rapid generalization to new tasks (e.g., few-shot learning methods), however, are mostly trained in a single shot on fixed datasets, unable to dynamically expand their knowledge; while continual learning algorithms are not specifically designed for rapid generalization. We present a new learning setup, Continual Learning of Few-Shot Learners (CLIF), to address challenges of both learning settings in a unified setup. CLIF assumes a model learns from a sequence of diverse NLP tasks arriving sequentially, accumulating knowledge for improved generalization to new tasks, while also retaining performance on the tasks learned earlier. We examine how the generalization ability is affected in the continual learning setup, evaluate a number of continual learning algorithms, and propose a novel regularized adapter generation approach. We find that catastrophic forgetting affects generalization ability to a lesser degree than performance on seen tasks; while continual learning algorithms can still bring considerable benefit to the generalization ability.

pdf bib
Efficient Test Time Adapter Ensembling for Low-resource Language Varieties
Xinyi Wang | Yulia Tsvetkov | Sebastian Ruder | Graham Neubig

Adapters are light-weight modules that allow parameter-efficient fine-tuning of pretrained models. Specialized language and task adapters have recently been proposed to facilitate cross-lingual transfer of multilingual pretrained models (Pfeiffer et al., 2020b). However, this approach requires training a separate language adapter for every language one wishes to support, which can be impractical for languages with limited data. An intuitive solution is to use a related language adapter for the new language variety, but we observe that this solution can lead to sub-optimal performance. In this paper, we aim to improve the robustness of language adapters to uncovered languages without training new adapters. We find that ensembling multiple existing language adapters makes the fine-tuned model significantly more robust to other language varieties not included in these adapters. Building upon this observation, we propose Entropy Minimized Ensemble of Adapters (EMEA), a method that optimizes the ensemble weights of the pretrained language adapters for each test sentence by minimizing the entropy of its predictions. Experiments on three diverse groups of language varieties show that our method leads to significant improvements on both named entity recognition and part-of-speech tagging across all languages.

pdf bib
An Analysis of Euclidean vs. Graph-Based Framing for Bilingual Lexicon Induction from Word Embedding Spaces
Kelly Marchisio | Youngser Park | Ali Saad-Eldin | Anton Alyakin | Kevin Duh | Carey Priebe | Philipp Koehn

Much recent work in bilingual lexicon induction (BLI) views word embeddings as vectors in Euclidean space. As such, BLI is typically solved by finding a linear transformation that maps embeddings to a common space. Alternatively, word embeddings may be understood as nodes in a weighted graph. This framing allows us to examine a node’s graph neighborhood without assuming a linear transform, and exploits new techniques from the graph matching optimization literature. These contrasting approaches have not been compared in BLI so far. In this work, we study the behavior of Euclidean versus graph-based approaches to BLI under differing data conditions and show that they complement each other when combined. We release our code at https://github.com/kellymarchisio/euc-v-graph-bli.

pdf bib
How to Select One Among All ? An Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding
Tianda Li | Ahmad Rashid | Aref Jafari | Pranav Sharma | Ali Ghodsi | Mehdi Rezagholizadeh

Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge in a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD, which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-the-art results on the GLUE benchmark, out-of-domain generalization, and adversarial robustness compared to competitive methods.

pdf bib
Recommend for a Reason: Unlocking the Power of Unsupervised Aspect-Sentiment Co-Extraction
Zeyu Li | Wei Cheng | Reema Kshetramade | John Houser | Haifeng Chen | Wei Wang

Compliments and concerns in reviews are valuable for understanding users’ shopping interests and their opinions with respect to specific aspects of certain items. Existing review-based recommenders favor large and complex language encoders that can only learn latent and uninterpretable text representations. They lack explicit user-attention and item-property modeling, which however could provide valuable information beyond the ability to recommend items. Therefore, we propose a tightly coupled two-stage approach, including an Aspect-Sentiment Pair Extractor (ASPE) and an Attention-Property-aware Rating Estimator (APRE). Unsupervised ASPE mines Aspect-Sentiment pairs (AS-pairs) and APRE predicts ratings using AS-pairs as concrete aspect-level evidences. Extensive experiments on seven real-world Amazon Review Datasets demonstrate that ASPE can effectively extract AS-pairs which enable APRE to deliver superior accuracy over the leading baselines.

pdf bib
Learning Hard Retrieval Decoder Attention for Transformers
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. In this paper, we present an approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens. The matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can thus be replaced by a simple and efficient retrieval operation. We show that our hard retrieval attention mechanism is 1.43 times faster in decoding, while preserving translation quality on a wide range of machine translation tasks when used in the decoder self- and cross-attention networks.

pdf bib
Recall and Learn: A Memory-augmented Solver for Math Word Problems
Shifeng Huang | Jiawei Wang | Jiao Xu | Da Cao | Ming Yang

In this article, we tackle the math word problem, namely, automatically answering a mathematical problem according to its textual description. Although recent methods have demonstrated their promising results, most of these methods are based on template-based generation scheme which results in limited generalization capability. To this end, we propose a novel human-like analogical learning method in a recall and learn manner. Our proposed framework is composed of modules of memory, representation, analogy, and reasoning, which are designed to make a new exercise by referring to the exercises learned in the past. Specifically, given a math word problem, the model first retrieves similar questions by a memory module and then encodes the unsolved problem and each retrieved question using a representation module. Moreover, to solve the problem in a way of analogy, an analogy module and a reasoning module with a copy mechanism are proposed to model the interrelationship between the problem and each retrieved question. Extensive experiments on two well-known datasets show the superiority of our proposed algorithm as compared to other state-of-the-art competitors from both overall performance comparison and micro-scope studies.

pdf bib
An Uncertainty-Aware Encoder for Aspect Detection
Thi-Nhung Nguyen | Kiem-Hieu Nguyen | Young-In Song | Tuan-Dung Cao

Aspect detection is a fundamental task in opinion mining. Previous works use seed words either as priors of topic models, as anchors to guide the learning of aspects, or as features of aspect classifiers. This paper presents a novel weakly-supervised method to exploit seed words for aspect detection based on an encoder architecture. The encoder maps segments and aspects into a low-dimensional embedding space. The goal is approximating similarity between segments and aspects in the embedding space and their ground-truth similarity generated from seed words. An objective function is proposed to capture the uncertainty of ground-truth similarity. Our method outperforms previous works on several benchmarks in various domains.

pdf bib
Improving Empathetic Response Generation by Recognizing Emotion Cause in Conversations
Jun Gao | Yuhan Liu | Haolin Deng | Wei Wang | Yu Cao | Jiachen Du | Ruifeng Xu

Current approaches to empathetic response generation focus on learning a model to predict an emotion label and generate a response based on this label and have achieved promising results. However, the emotion cause, an essential factor for empathetic responding, is ignored. The emotion cause is a stimulus for human emotions. Recognizing the emotion cause is helpful to better understand human emotions so as to generate more empathetic responses. To this end, we propose a novel framework that improves empathetic response generation by recognizing emotion cause in conversations. Specifically, an emotion reasoner is designed to predict a context emotion label and a sequence of emotion cause-oriented labels, which indicate whether the word is related to the emotion cause. Then we devise both hard and soft gated attention mechanisms to incorporate the emotion cause into response generation. Experiments show that incorporating emotion cause information improves the performance of the model on both emotion recognition and response generation.

pdf bib
Probing Across Time: What Does RoBERTa Know and When?
Zeyu Liu | Yizhong Wang | Jungo Kasai | Hannaneh Hajishirzi | Noah A. Smith

Models of language trained on very large corpora have been demonstrated useful for natural language processing. As fixed artifacts, they have become the object of intense study, with many researchers “probing” the extent to which they acquire and readily demonstrate linguistic abstractions, factual and commonsense knowledge, and reasoning abilities. Recent work applied several probes to intermediate training stages to observe the developmental process of a large-scale model (Chiang et al., 2020). Following this effort, we systematically answer a question: for various types of knowledge a language model learns, when during (pre)training are they acquired? Using RoBERTa as a case study, we find: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

pdf bib
Knowledge-Guided Paraphrase Identification
Haoyu Wang | Fenglong Ma | Yaqing Wang | Jing Gao

Paraphrase identification (PI), a fundamental task in natural language processing, is to identify whether two sentences express the same or similar meaning, which is a binary classification problem. Recently, BERT-like pre-trained language models have been a popular choice for the frameworks of various PI models, but almost all existing methods consider general domain text. When these approaches are applied to a specific domain, existing models cannot make accurate predictions due to the lack of professional knowledge. In light of this challenge, we propose a novel framework, namely , which can leverage the external unstructured Wikipedia knowledge to accurately identify paraphrases. We propose to mine outline knowledge of concepts related to given sentences from Wikipedia via BM25 model. After retrieving related outline knowledge, makes predictions based on both the semantic information of two sentences and the outline knowledge. Besides, we propose a gating mechanism to aggregate the semantic information-based prediction and the knowledge-based prediction. Extensive experiments are conducted on two public datasets: PARADE (a computer science domain dataset) and clinicalSTS2019 (a biomedical domain dataset). The results show that the proposed outperforms state-of-the-art methods.

pdf bib
R2-D2: A Modular Baseline for Open-Domain Question Answering
Martin Fajcik | Martin Docekal | Karel Ondrej | Pavel Smrz

This work presents a novel four-stage open-domain QA pipeline R2-D2 (Rank twice, reaD twice). The pipeline is composed of a retriever, passage reranker, extractive reader, generative reader and a mechanism that aggregates the final prediction from all system’s components. We demonstrate its strength across three open-domain QA datasets: NaturalQuestions, TriviaQA and EfficientQA, surpassing state-of-the-art on the first two. Our analysis demonstrates that: (i) combining extractive and generative reader yields absolute improvements up to 5 exact match and it is at least twice as effective as the posterior averaging ensemble of the same models with different parameters, (ii) the extractive reader with fewer parameters can match the performance of the generative reader on extractive QA datasets.

pdf bib
What Does Your Smile Mean? Jointly Detecting Multi-Modal Sarcasm and Sentiment Using Quantum Probability
Yaochen Liu | Yazhou Zhang | Qiuchi Li | Benyou Wang | Dawei Song

Sarcasm and sentiment embody intrinsic uncertainty of human cognition, making joint detection of multi-modal sarcasm and sentiment a challenging task. In view of the advantages of quantum probability (QP) in modeling such uncertainty, this paper explores the potential of QP as a mathematical framework and proposes a QP driven multi-task (QPM) learning framework. The QPM framework involves a complex-valued multi-modal representation encoder, a quantum-like fusion subnetwork and a quantum measurement mechanism. Each multi-modal (e.g., textual, visual) utterance is first encoded as a quantum superposition of a set of basis terms using a complex-valued representation. Then, the quantum-like fusion subnetwork leverages quantum state composition and quantum interference to model the contextual interaction between adjacent utterances and the correlations across modalities respectively. Finally, quantum incompatible measurements are performed on the multi-modal representation of each utterance to yield the probabilistic outcomes of sarcasm and sentiment recognition. The experimental results show that our model achieves a state-of-the-art performance.

pdf bib
Discovering Representation Sprachbund For Multilingual Pre-Training
Yimin Fan | Yaobo Liang | Alexandre Muzio | Hany Hassan | Houqiang Li | Ming Zhou | Nan Duan

Multilingual pre-trained models have demonstrated their effectiveness in many multilingual NLP tasks and enabled zero-shot or few-shot transfer from high-resource languages to low-resource ones. However, due to significant typological differences and contradictions between some languages, such models usually perform poorly on many languages and cross-lingual settings, which shows the difficulty of learning a single model to handle massive diverse languages well at the same time. To alleviate this issue, we present a new multilingual pre-training pipeline. We propose to generate language representation from multilingual pre-trained model and conduct linguistic analysis to show that language representation similarity reflects linguistic similarity from multiple perspectives, including language family, geographical sprachbund, lexicostatistics, and syntax. Then we cluster all the target languages into multiple groups and name each group as a representation sprachbund. Thus, languages in the same representation sprachbund are supposed to boost each other in both pre-training and fine-tuning as they share rich linguistic similarity. We pre-train one multilingual model for each representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.

pdf bib
Plan-then-Generate: Controlled Data-to-Text Generation via Planning
Yixuan Su | David Vandyke | Sihui Wang | Yimai Fang | Nigel Collier

Recent developments in neural networks have led to the advance in data-to-text generation. However, the lack of ability of neural models to control the structure of generated output can be limiting in certain real-world applications. In this study, we propose a novel Plan-then-Generate (PlanGen) framework to improve the controllability of neural data-to-text models. Extensive experiments and analyses are conducted on two benchmark datasets, ToTTo and WebNLG. The results show that our model is able to control both the intra-sentence and inter-sentence structure of the generated output. Furthermore, empirical comparisons against previous state-of-the-art methods show that our model improves the generation quality as well as the output diversity as judged by human and automatic evaluations.

pdf bib
Few-Shot Table-to-Text Generation with Prototype Memory
Yixuan Su | Zaiqiao Meng | Simon Baker | Nigel Collier

Neural table-to-text generation models have achieved remarkable progress on an array of tasks. However, due to the data-hungry nature of neural models, their performances strongly rely on large-scale training examples, limiting their applicability in real-world applications. To address this, we propose a new framework: Prototype-to-Generate (P2G), for table-to-text generation under the few-shot scenario. The proposed framework utilizes the retrieved prototypes, which are jointly selected by an IR system and a novel prototype selector to help the model bridging the structural gap between tables and texts. Experimental results on three benchmark datasets with three state-of-the-art models demonstrate that the proposed framework significantly improves the model performance across various evaluation metrics.

pdf bib
Leveraging Word-Formation Knowledge for Chinese Word Sense Disambiguation
Hua Zheng | Lei Li | Damai Dai | Deli Chen | Tianyu Liu | Xu Sun | Yang Liu

In parataxis languages like Chinese, word meanings are constructed using specific word-formations, which can help to disambiguate word senses. However, such knowledge is rarely explored in previous word sense disambiguation (WSD) methods. In this paper, we propose to leverage word-formation knowledge to enhance Chinese WSD. We first construct a large-scale Chinese lexical sample WSD dataset with word-formations. Then, we propose a model FormBERT to explicitly incorporate word-formations into sense disambiguation. To further enhance generalizability, we design a word-formation predictor module in case word-formation annotations are unavailable. Experimental results show that our method brings substantial performance improvement over strong baselines.

pdf bib
Exploiting Curriculum Learning in Unsupervised Neural Machine Translation
Jinliang Lu | Jiajun Zhang

Back-translation (BT) has become one of the de facto components in unsupervised neural machine translation (UNMT), and it explicitly makes UNMT have translation ability. However, all the pseudo bi-texts generated by BT are treated equally as clean data during optimization without considering the quality diversity, leading to slow convergence and limited translation performance. To address this problem, we propose a curriculum learning method to gradually utilize pseudo bi-texts based on their quality from multiple granularities. Specifically, we first apply crosslingual word embedding to calculate the potential translation difficulty (quality) for the monolingual sentences. Then, the sentences are fed into UNMT from easy to hard batch by batch. Furthermore, considering the quality of sentences/tokens in a particular batch are also diverse, we further adopt the model itself to calculate the fine-grained quality scores, which are served as learning factors to balance the contributions of different parts when computing loss and encourage the UNMT model to focus on pseudo data with higher quality. Experimental results on WMT 14 En-Fr, WMT 14 En-De, WMT 16 En-Ro, and LDC En-Zh translation tasks demonstrate that the proposed method achieves consistent improvements with faster convergence speed.

pdf bib
Robust Fragment-Based Framework for Cross-lingual Sentence Retrieval
Nattapol Trijakwanich | Peerat Limkonchotiwat | Raheem Sarwar | Wannaphong Phatthiyaphaibun | Ekapol Chuangsuwanich | Sarana Nutanong

Cross-lingual Sentence Retrieval (CLSR) aims at retrieving parallel sentence pairs that are translations of each other from a multilingual set of comparable documents. The retrieved parallel sentence pairs can be used in other downstream NLP tasks such as machine translation and cross-lingual word sense disambiguation. We propose a CLSR framework called Robust Fragment-level Representation (RFR) CLSR framework to address Out-of-Domain (OOD) CLSR problems. In particular, we improve the sentence retrieval robustness by representing each sentence as a collection of fragments. In this way, we change the retrieval granularity from the sentence to the fragment level. We performed CLSR experiments based on three OOD datasets, four language pairs, and three base well-known sentence encoders: m-USE, LASER, and LaBSE. Experimental results show that RFR significantly improves the base encoders’ performance for more than 85% of the cases.

pdf bib
Towards Improving Adversarial Training of NLP Models
Jin Yong Yoo | Yanjun Qi

Adversarial training, a method for learning robust deep neural networks, constructs adversarial examples during training. However, recent methods for generating NLP adversarial examples involve combinatorial search and expensive sentence encoders for constraining the generated instances. As a result, it remains challenging to use vanilla adversarial training to improve NLP models’ performance, and the benefits are mainly uninvestigated. This paper proposes a simple and improved vanilla adversarial training process for NLP models, which we name Attacking to Training (A2T). The core part of A2T is a new and cheaper word substitution attack optimized for vanilla adversarial training. We use A2T to train BERT and RoBERTa models on IMDB, Rotten Tomatoes, Yelp, and SNLI datasets. Our results empirically show that it is possible to train robust NLP models using a much cheaper adversary. We demonstrate that vanilla adversarial training with A2T can improve an NLP model’s robustness to the attack it was originally trained with and also defend the model against other types of word substitution attacks. Furthermore, we show that A2T can improve NLP models’ standard accuracy, cross-domain generalization, and interpretability.

pdf bib
To Protect and To Serve? Analyzing Entity-Centric Framing of Police Violence
Caleb Ziems | Diyi Yang

Framing has significant but subtle effects on public opinion and policy. We propose an NLP framework to measure entity-centric frames. We use it to understand media coverage on police violence in the United States in a new Police Violence Frames Corpus of 82k news articles spanning 7k police killings. Our work uncovers more than a dozen framing devices and reveals significant differences in the way liberal and conservative news sources frame both the issue of police violence and the entities involved. Conservative sources emphasize when the victim is armed or attacking an officer and are more likely to mention the victim’s criminal record. Liberal sources focus more on the underlying systemic injustice, highlighting the victim’s race and that they were unarmed. We discover temporary spikes in these injustice frames near high-profile shooting events, and finally, we show protest volume correlates with and precedes media framing decisions.

pdf bib
Calibrate your listeners! Robust communication-based training for pragmatic speakers
Rose Wang | Julia White | Jesse Mu | Noah Goodman

To be good conversational partners, natural language processing (NLP) systems should be trained to produce contextually useful utterances. Prior work has investigated training NLP systems with communication-based objectives, where a neural listener stands in as a communication partner. However, these systems commonly suffer from semantic drift where the learned language diverges radically from natural language. We propose a method that uses a population of neural listeners to regularize speaker training. We first show that language drift originates from the poor uncertainty calibration of a neural listener, which makes high-certainty predictions on novel sentences. We explore ensemble- and dropout-based populations of listeners and find that the former results in better uncertainty quantification. We evaluate both population-based objectives on reference games, and show that the ensemble method with better calibration enables the speaker to generate pragmatic utterances while scaling to a large vocabulary and generalizing to new games and listeners.

pdf bib
When Retriever-Reader Meets Scenario-Based Multiple-Choice Questions
ZiXian Huang | Ao Wu | Yulin Shen | Gong Cheng | Yuzhong Qu

Scenario-based question answering (SQA) requires retrieving and reading paragraphs from a large corpus to answer a question which is contextualized by a long scenario description. Since a scenario contains both keyphrases for retrieval and much noise, retrieval for SQA is extremely difficult. Moreover, it can hardly be supervised due to the lack of relevance labels of paragraphs for SQA. To meet the challenge, in this paper we propose a joint retriever-reader model called JEEVES where the retriever is implicitly supervised only using QA labels via a novel word weighting mechanism. JEEVES significantly outperforms a variety of strong baselines on multiple-choice questions in three SQA datasets.

pdf bib
Structured abbreviation expansion in context
Kyle Gorman | Christo Kirov | Brian Roark | Richard Sproat

Ad hoc abbreviations are commonly found in informal communication channels that favor shorter messages. We consider the task of reversing these abbreviations in context to recover normalized, expanded versions of abbreviated messages. The problem is related to, but distinct from, spelling correction, as ad hoc abbreviations are intentional and can involve more substantial differences from the original words. Ad hoc abbreviations are also productively generated on-the-fly, so they cannot be resolved solely by dictionary lookup. We generate a large, open-source data set of ad hoc abbreviations. This data is used to study abbreviation strategies and to develop two strong baselines for abbreviation expansion.

pdf bib
Task-adaptive Pre-training and Self-training are Complementary for Natural Language Understanding
Shiyang Li | Semih Yavuz | Wenhu Chen | Xifeng Yan

Task-adaptive pre-training (TAPT) and Self-training (ST) have emerged as the major semi-supervised approaches to improve natural language understanding (NLU) tasks with massive amount of unlabeled data. However, it’s unclear whether they learn similar representations or they can be effectively combined. In this paper, we show that TAPT and ST can be complementary with simple TFS protocol by following TAPT -> Finetuning -> Self-training (TFS) process. Experimental results show that TFS protocol can effectively utilize unlabeled data to achieve strong combined gains consistently across six datasets covering sentiment classification, paraphrase identification, natural language inference, named entity recognition and dialogue slot classification. We investigate various semi-supervised settings and consistently show that gains from TAPT and ST can be strongly additive by following TFS procedure. We hope that TFS could serve as an important semi-supervised baseline for future NLP studies.

pdf bib
CNNBiF: CNN-based Bigram Features for Named Entity Recognition
Chul Sung | Vaibhava Goel | Etienne Marcheret | Steven Rennie | David Nahamoo

Transformer models fine-tuned with a sequence labeling objective have become the dominant choice for named entity recognition tasks. However, a self-attention mechanism with unconstrained length can fail to fully capture local dependencies, particularly when training data is limited. In this paper, we propose a novel joint training objective which better captures the semantics of words corresponding to the same entity. By augmenting the training objective with a group-consistency loss component we enhance our ability to capture local dependencies while still enjoying the advantages of the unconstrained self-attention mechanism. On the CoNLL2003 dataset, our method achieves a test F1 of 93.98 with a single transformer model. More importantly our fine-tuned CoNLL2003 model displays significant gains in generalization to out of domain datasets: on the OntoNotes subset we achieve an F1 of 72.67 which is 0.49 points absolute better than the baseline, and on the WNUT16 set an F1 of 68.22 which is a gain of 0.48 points. Furthermore, on the WNUT17 dataset we achieve an F1 of 55.85, yielding a 2.92 point absolute improvement.

pdf bib
Compositional Generalization via Semantic Tagging
Hao Zheng | Mirella Lapata

Although neural sequence-to-sequence models have been successfully applied to semantic parsing, they fail at compositional generalization, i.e., they are unable to systematically generalize to unseen compositions of seen components. Motivated by traditional semantic parsing where compositionality is explicitly accounted for by symbolic grammars, we propose a new decoding framework that preserves the expressivity and generality of sequence-to-sequence models while featuring lexicon-style alignments and disentangled information processing. Specifically, we decompose decoding into two phases where an input utterance is first tagged with semantic symbols representing the meaning of individual words, and then a sequence-to-sequence model is used to predict the final meaning representation conditioning on the utterance and the predicted tag sequence. Experimental results on three semantic parsing datasets show that the proposed approach consistently improves compositional generalization across model architectures, domains, and semantic formalisms.

pdf bib
Towards Document-Level Paraphrase Generation with Sentence Rewriting and Reordering
Zhe Lin | Yitao Cai | Xiaojun Wan

Paraphrase generation is an important task in natural language processing. Previous works focus on sentence-level paraphrase generation, while ignoring document-level paraphrase generation, which is a more challenging and valuable task. In this paper, we explore the task of document-level paraphrase generation for the first time and focus on the inter-sentence diversity by considering sentence rewriting and reordering. We propose CoRPG (Coherence Relationship guided Paraphrase Generation), which leverages graph GRU to encode the coherence relationship graph and get the coherence-aware representation for each sentence, which can be used for re-arranging the multiple (possibly modified) input sentences. We create a pseudo document-level paraphrase dataset for training CoRPG. Automatic evaluation results show CoRPG outperforms several strong baseline models on the BERTScore and diversity scores. Human evaluation also shows our model can generate document paraphrase with more diversity and semantic preservation.

pdf bib
Exploring Decomposition for Table-based Fact Verification
Xiaoyu Yang | Xiaodan Zhu

Fact verification based on structured data is challenging as it requires models to understand both natural language and symbolic operations performed over tables. Although pre-trained language models have demonstrated a strong capability in verifying simple statements, they struggle with complex statements that involve multiple operations. In this paper, we improve fact verification by decomposing complex statements into simpler subproblems. Leveraging the programs synthesized by a weakly supervised semantic parser, we propose a program-guided approach to constructing a pseudo dataset for decomposition model training. The subproblems, together with their predicted answers, serve as the intermediate evidence to enhance our fact verification model. Experiments show that our proposed approach achieves the new state-of-the-art performance, an 82.7% accuracy, on the TabFact benchmark.

pdf bib
Diversity and Consistency: Exploring Visual Question-Answer Pair Generation
Sen Yang | Qingyu Zhou | Dawei Feng | Yang Liu | Chao Li | Yunbo Cao | Dongsheng Li

Although showing promising values to downstream applications, generating question and answer together is under-explored. In this paper, we introduce a novel task that targets question-answer pair generation from visual images. It requires not only generating diverse question-answer pairs but also keeping the consistency of them. We study different generation paradigms for this task and propose three models: the pipeline model, the joint model, and the sequential model. We integrate variational inference into these models to achieve diversity and consistency. We also propose region representation scaling and attention alignment to improve the consistency further. We finally devise an evaluator as a quantitative metric for consistency. We validate our approach on two benchmarks, VQA2.0 and Visual-7w, by automatically and manually evaluating diversity and consistency. Experimental results show the effectiveness of our models: they can generate diverse or consistent pairs. Moreover, this task can be used to improve visual question generation and visual question answering.

pdf bib
Entity-level Cross-modal Learning Improves Multi-modal Machine Translation
Xin Huang | Jiajun Zhang | Chengqing Zong

Multi-modal machine translation (MMT) aims at improving translation performance by incorporating visual information. Most of the studies leverage the visual information through integrating the global image features as auxiliary input or decoding by attending to relevant local regions of the image. However, this kind of usage of visual information makes it difficult to figure out how the visual modality helps and why it works. Inspired by the findings of (CITATION) that entities are most informative in the image, we propose an explicit entity-level cross-modal learning approach that aims to augment the entity representation. Specifically, the approach is framed as a reconstruction task that reconstructs the original textural input from multi-modal input in which entities are replaced with visual features. Then, a multi-task framework is employed to combine the translation task and the reconstruction task to make full use of cross-modal entity representation learning. The extensive experiments demonstrate that our approach can achieve comparable or even better performance than state-of-the-art models. Furthermore, our in-depth analysis shows how visual information improves translation.

pdf bib
Learning to Ground Visual Objects for Visual Dialog
Feilong Chen | Xiuyi Chen | Can Xu | Daxin Jiang

Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, while these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process. Meanwhile, a prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounding even without answers during the inference process. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that our approach improves the previous strong models in both generative and discriminative settings by a significant margin.

pdf bib
KERS: A Knowledge-Enhanced Framework for Recommendation Dialog Systems with Multiple Subgoals
Jun Zhang | Yan Yang | Chencai Chen | Liang He | Zhou Yu

Recommendation dialogs require the system to build a social bond with users to gain trust and develop affinity in order to increase the chance of a successful recommendation. It is beneficial to divide up, such conversations with multiple subgoals (such as social chat, question answering, recommendation, etc.), so that the system can retrieve appropriate knowledge with better accuracy under different subgoals. In this paper, we propose a unified framework for common knowledge-based multi-subgoal dialog: knowledge-enhanced multi-subgoal driven recommender system (KERS). We first predict a sequence of subgoals and use them to guide the dialog model to select knowledge from a sub-set of existing knowledge graph. We then propose three new mechanisms to filter noisy knowledge and to enhance the inclusion of cleaned knowledge in the dialog response generation process. Experiments show that our method obtains state-of-the-art results on DuRecDial dataset in both automatic and human evaluation.

pdf bib
Less Is More: Domain Adaptation with Lottery Ticket for Reading Comprehension
Haichao Zhu | Zekun Wang | Heng Zhang | Ming Liu | Sendong Zhao | Bing Qin

In this paper, we propose a simple few-shot domain adaptation paradigm for reading comprehension. We first identify the lottery subnetwork structure within the Transformer-based source domain model via gradual magnitude pruning. Then, we only fine-tune the lottery subnetwork, a small fraction of the whole parameters, on the annotated target domain data for adaptation. To obtain more adaptable subnetworks, we introduce self-attention attribution to weigh parameters, beyond simply pruning the smallest magnitude parameters, which can be seen as combining structured pruning and unstructured magnitude pruning softly. Experimental results show that our method outperforms the full model fine-tuning adaptation on four out of five domains when only a small amount of annotated data available for adaptation. Moreover, introducing self-attention attribution reserves more parameters for important attention heads in the lottery subnetwork and improves the target domain model performance. Our further analyses reveal that, besides exploiting fewer parameters, the choice of subnetworks is critical to the effectiveness.

pdf bib
Effectiveness of Pre-training for Few-shot Intent Classification
Haode Zhang | Yuwei Zhang | Li-Ming Zhan | Jiaxin Chen | Guangyuan Shi | Xiao-Ming Wu | Albert Y.S. Lam

This paper investigates the effectiveness of pre-training for few-shot intent classification. While existing paradigms commonly further pre-train language models such as BERT on a vast amount of unlabeled corpus, we find it highly effective and efficient to simply fine-tune BERT with a small set of labeled utterances from public datasets. Specifically, fine-tuning BERT with roughly 1,000 labeled data yields a pre-trained model – IntentBERT, which can easily surpass the performance of existing pre-trained models for few-shot intent classification on novel domains with very different semantics. The high effectiveness of IntentBERT confirms the feasibility and practicality of few-shot intent detection, and its high generalization ability across different domains suggests that intent classification tasks may share a similar underlying structure, which can be efficiently learned from a small set of labeled data. The source code can be found at https://github.com/hdzhang-code/IntentBERT.

pdf bib
Improving Abstractive Dialogue Summarization with Hierarchical Pretraining and Topic Segment
MengNan Qi | Hao Liu | YuZhuo Fu | Ting Liu

With the increasing abundance of meeting transcripts, meeting summary has attracted more and more attention from researchers. The unsupervised pre-training method based on transformer structure combined with fine-tuning of downstream tasks has achieved great success in the field of text summarization. However, the semantic structure and style of meeting transcripts are quite different from that of articles. In this work, we propose a hierarchical transformer encoder-decoder network with multi-task pre-training. Specifically, we mask key sentences at the word-level encoder and generate them at the decoder. Besides, we randomly mask some of the role alignments in the input text and force the model to recover the original role tags to complete the alignments. In addition, we introduce a topic segmentation mechanism to further improve the quality of the generated summaries. The experimental results show that our model is superior to the previous methods in meeting summary datasets AMI and ICSI.

pdf bib
Learning to Answer Psychological Questionnaire for Personality Detection
Feifan Yang | Tao Yang | Xiaojun Quan | Qinliang Su

Existing text-based personality detection research mostly relies on data-driven approaches to implicitly capture personality cues in online posts, lacking the guidance of psychological knowledge. Psychological questionnaire, which contains a series of dedicated questions highly related to personality traits, plays a critical role in self-report personality assessment. We argue that the posts created by a user contain critical contents that could help answer the questions in a questionnaire, resulting in an assessment of his personality by linking the texts and the questionnaire. To this end, we propose a new model named Psychological Questionnaire enhanced Network (PQ-Net) to guide personality detection by tracking critical information in texts with a questionnaire. Specifically, PQ-Net contains two streams: a context stream to encode each piece of text into a contextual text representation, and a questionnaire stream to capture relevant information in the contextual text representation to generate potential answer representations for a questionnaire. The potential answer representations are used to enhance the contextual text representation and to benefit personality prediction. Experimental results on two datasets demonstrate the superiority of PQ-Net in capturing useful cues from the posts for personality detection.

pdf bib
Exploiting Reasoning Chains for Multi-hop Science Question Answering
Weiwen Xu | Yang Deng | Huihui Zhang | Deng Cai | Wai Lam

We propose a novel Chain Guided Retriever-reader (CGR) framework to model the reasoning chain for multi-hop Science Question Answering. Our framework is capable of performing explainable reasoning without the need of any corpus-specific annotations, such as the ground-truth reasoning chain, or human-annotated entity mentions. Specifically, we first generate reasoning chains from a semantic graph constructed by Abstract Meaning Representation of retrieved evidence facts. A Chain-aware loss, concerning both local and global chain information, is also designed to enable the generated chains to serve as distant supervision signals for training the retriever, where reinforcement learning is also adopted to maximize the utility of the reasoning chains. Our framework allows the retriever to capture step-by-step clues of the entire reasoning process, which is not only shown to be effective on two challenging multi-hop Science QA tasks, namely OpenBookQA and ARC-Challenge, but also favors explainability.

pdf bib
Winnowing Knowledge for Multi-choice Question Answering
Yeqiu Li | Bowei Zou | Zhifeng Li | Ai Ti Aw | Yu Hong | Qiaoming Zhu

We tackle multi-choice question answering. Acquiring related commonsense knowledge to the question and options facilitates the recognition of the correct answer. However, the current reasoning models suffer from the noises in the retrieved knowledge. In this paper, we propose a novel encoding method which is able to conduct interception and soft filtering. This contributes to the harvesting and absorption of representative information with less interference from noises. We experiment on CommonsenseQA. Experimental results illustrate that our method yields substantial and consistent improvements compared to the strong Bert, RoBERTa and Albert-based baselines.

pdf bib
Neural Media Bias Detection Using Distant Supervision With BABE - Bias Annotations By Experts
Timo Spinde | Manuel Plank | Jan-David Krieger | Terry Ruas | Bela Gipp | Akiko Aizawa

Media coverage has a substantial effect on the public perception of events. Nevertheless, media outlets are often biased. One way to bias news articles is by altering the word choice. The automatic identification of bias by word choice is challenging, primarily due to the lack of a gold standard data set and high context dependencies. This paper presents BABE, a robust and diverse data set created by trained experts, for media bias research. We also analyze why expert labeling is essential within this domain. Our data set offers better annotation quality and higher inter-annotator agreement than existing work. It consists of 3,700 sentences balanced among topics and outlets, containing media bias labels on the word and sentence level. Based on our data, we also introduce a way to detect bias-inducing sentences in news articles automatically. Our best performing BERT-based model is pre-trained on a larger corpus consisting of distant labels. Fine-tuning and evaluating the model on our proposed supervised data set, we achieve a macro F1-score of 0.804, outperforming existing methods.

pdf bib
Learning and Evaluating a Differentially Private Pre-trained Language Model
Shlomo Hoory | Amir Feder | Avichai Tendler | Sofia Erell | Alon Peled-Cohen | Itay Laish | Hootan Nakhost | Uri Stemmer | Ayelet Benjamini | Avinatan Hassidim | Yossi Matias

Contextual language models have led to significantly better results, especially when pre-trained on the same data as the downstream task. While this additional pre-training usually improves performance, it can lead to information leakage and therefore risks the privacy of individuals mentioned in the training data. One method to guarantee the privacy of such individuals is to train a differentially-private language model, but this usually comes at the expense of model performance. Also, in the absence of a differentially private vocabulary training, it is not possible to modify the vocabulary to fit the new data, which might further degrade results. In this work we bridge these gaps, and provide guidance to future researchers and practitioners on how to improve privacy while maintaining good model performance. We introduce a novel differentially private word-piece algorithm, which allows training a tailored domain-specific vocabulary while maintaining privacy. We then experiment with entity extraction tasks from clinical notes, and demonstrate how to train a differentially private pre-trained language model (i.e., BERT) with a privacy guarantee of 𝜖=1.1 and with only a small degradation in performance. Finally, as it is hard to tell given a privacy parameter 𝜖 what was the effect on the trained representation, we present experiments showing that the trained model does not memorize private information.

pdf bib
Simulated Chats for Building Dialog Systems: Learning to Generate Conversations from Instructions
Biswesh Mohapatra | Gaurav Pandey | Danish Contractor | Sachindra Joshi

Popular dialog datasets such as MultiWOZ are created by providing crowd workers an instruction, expressed in natural language, that describes the task to be accomplished. Crowd workers play the role of a user and an agent to generate dialogs to accomplish tasks involving booking restaurant tables, calling a taxi etc. In this paper, we present a data creation strategy that uses the pre-trained language model, GPT2, to simulate the interaction between crowd workers by creating a user bot and an agent bot. We train the simulators using a smaller percentage of actual crowd-generated conversations and their corresponding instructions. We demonstrate that by using the simulated data, we achieve significant improvements in low-resource settings on two publicly available datasets - MultiWOZ dataset and the Persona chat dataset.

pdf bib
Past, Present, and Future: Conversational Emotion Recognition through Structural Modeling of Psychological Knowledge
Jiangnan Li | Zheng Lin | Peng Fu | Weiping Wang

Conversational Emotion Recognition (CER) is a task to predict the emotion of an utterance in the context of a conversation. Although modeling the conversational context and interactions between speakers has been studied broadly, it is important to consider the speaker’s psychological state, which controls the action and intention of the speaker. The state-of-the-art method introduces CommonSense Knowledge (CSK) to model psychological states in a sequential way (forwards and backwards). However, it ignores the structural psychological interactions between utterances. In this paper, we propose a pSychological-Knowledge-Aware Interaction Graph (SKAIG). In the locally connected graph, the targeted utterance will be enhanced with the information of action inferred from the past context and intention implied by the future context. The utterance is self-connected to consider the present effect from itself. Furthermore, we utilize CSK to enrich edges with knowledge representations and process the SKAIG with a graph transformer. Our method achieves state-of-the-art and competitive performance on four popular CER datasets.

pdf bib
An unsupervised framework for tracing textual sources of moral change
Aida Ramezani | Zining Zhu | Frank Rudzicz | Yang Xu

Morality plays an important role in social well-being, but people’s moral perception is not stable and changes over time. Recent advances in natural language processing have shown that text is an effective medium for informing moral change, but no attempt has been made to quantify the origins of these changes. We present a novel unsupervised framework for tracing textual sources of moral change toward entities through time. We characterize moral change with probabilistic topical distributions and infer the source text that exerts prominent influence on the moral time course. We evaluate our framework on a diverse set of data ranging from social media to news articles. We show that our framework not only captures fine-grained human moral judgments, but also identifies coherent source topics of moral change triggered by historical events. We apply our methodology to analyze the news in the COVID-19 pandemic and demonstrate its utility in identifying sources of moral change in high-impact and real-time social events.

pdf bib
Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization
Junpeng Liu | Yanyan Zou | Hainan Zhang | Hongshen Chen | Zhuoye Ding | Caixia Yuan | Xiaojie Wang

Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via .

pdf bib
TWT: Table with Written Text for Controlled Data-to-Text Generation
Tongliang Li | Lei Fang | Jian-Guang Lou | Zhoujun Li

Large pre-trained neural models have recently shown remarkable progress in text generation. In this paper, we propose to generate text conditioned on the structured data (table) and a prefix (the written text) by leveraging the pre-trained models. We present a new data-to-text dataset, Table with Written Text (TWT), by repurposing two existing datasets: ToTTo and TabFact. TWT contains both factual and logical statements that are faithful to the structured data, aiming to serve as a useful benchmark for controlled text generation. Compared with existing data-to-text task settings, TWT is more intuitive, the prefix (usually provided by the user) controls the topic of the generated text. Existing methods usually output hallucinated text that is not faithful on TWT. Therefore, we design a novel approach with table-aware attention visibility and copy mechanism over the table. Experimental results show that our approach outperforms state-of-the-art methods under both automatic and human evaluation metrics.

pdf bib
ArabicTransformer: Efficient Large Arabic Language Model with Funnel Transformer and ELECTRA Objective
Sultan Alrowili | Vijay Shanker

Pre-training Transformer-based models such as BERT and ELECTRA on a collection of Arabic corpora, demonstrated by both AraBERT and AraELECTRA, shows an impressive result on downstream tasks. However, pre-training Transformer-based language models is computationally expensive, especially for large-scale models. Recently, Funnel Transformer has addressed the sequential redundancy inside Transformer architecture by compressing the sequence of hidden states, leading to a significant reduction in the pre-training cost. This paper empirically studies the performance and efficiency of building an Arabic language model with Funnel Transformer and ELECTRA objective. We find that our model achieves state-of-the-art results on several Arabic downstream tasks despite using less computational resources compared to other BERT-based models.

pdf bib
Which is Making the Contribution: Modulating Unimodal and Cross-modal Dynamics for Multimodal Sentiment Analysis
Ying Zeng | Sijie Mai | Haifeng Hu

Multimodal sentiment analysis (MSA) draws increasing attention with the availability of multimodal data. The boost in performance of MSA models is mainly hindered by two problems. On the one hand, recent MSA works mostly focus on learning cross-modal dynamics, but neglect to explore an optimal solution for unimodal networks, which determines the lower limit of MSA models. On the other hand, noisy information hidden in each modality interferes the learning of correct cross-modal dynamics. To address the above-mentioned problems, we propose a novel MSA framework Modulation Model for Multimodal Sentiment Analysis (M3SA) to identify the contribution of modalities and reduce the impact of noisy information, so as to better learn unimodal and cross-modal dynamics. Specifically, modulation loss is designed to modulate the loss contribution based on the confidence of individual modalities in each utterance, so as to explore an optimal update solution for each unimodal network. Besides, contrary to most existing works which fail to explicitly filter out noisy information, we devise a modality filter module to identify and filter out modality noise for the learning of correct cross-modal embedding. Extensive experiments on publicly datasets demonstrate that our approach achieves state-of-the-art performance.

pdf bib
CVAE-based Re-anchoring for Implicit Discourse Relation Classification
Zujun Dou | Yu Hong | Yu Sun | Guodong Zhou

Training implicit discourse relation classifiers suffers from data sparsity. Variational AutoEncoder (VAE) appears to be the proper solution. It is because ideally VAE is capable of generating inexhaustible varying samples, and this facilitates selective data augmentation. However, our experiments show that coupling VAE with the RoBERTa-based classifier results in severe performance degradation. We ascribe the unusual phenomenon to erroneous sampling that would happen when VAE pursued variations. To overcome the problem, we develop a re-anchoring strategy, where Conditional VAE (CVAE) is used for estimating the risk of erroneous sampling, and meanwhile migrating the anchor to reduce the risk. The test results on PDTB v2.0 illustrate that, compared to the RoBERTa-based baseline, re-anchoring yields substantial improvements. Besides, we observe that re-anchoring can cooperate with other auxiliary strategies (transfer learning and interactive attention mechanism) to further improve the baseline, obtaining the F-scores of about 55%, 63%, 80% and 44% for the four main relation types (Comparison, Contingency, Expansion, Temporality) in the binary classification (Yes/No) scenario.

pdf bib
Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation
Qingqing Zhu | Xiuying Chen | Pengfei Wu | JunFei Liu | Dongyan Zhao

Curriculum learning, a machine training strategy that feeds training instances to the model from easy to hard, has been proven to facilitate the dialogue generation task. Meanwhile, knowledge distillation, a knowledge transformation methodology among teachers and students networks can yield significant performance boost for student models. Hence, in this paper, we introduce a combination of curriculum learning and knowledge distillation for efficient dialogue generation models, where curriculum learning can help knowledge distillation from data and model aspects. To start with, from the data aspect, we cluster the training cases according to their complexity, which is calculated by various types of features such as sentence length and coherence between dialog pairs. Furthermore, we employ an adversarial training strategy to identify the complexity of cases from model level. The intuition is that, if a discriminator can tell the generated response is from the teacher or the student, then the case is difficult that the student model has not adapted to yet. Finally, we use self-paced learning, which is an extension to curriculum learning to assign weights for distillation. In conclusion, we arrange a hierarchical curriculum based on the above two aspects for the student model under the guidance from the teacher model. Experimental results demonstrate that our methods achieve improvements compared with competitive baselines.

pdf bib
Improving End-to-End Task-Oriented Dialog System with A Simple Auxiliary Task
Yohan Lee

The paradigm of leveraging large pre-trained language models has made significant progress on benchmarks on task-oriented dialogue (TOD) systems. In this paper, we combine this paradigm with multi-task learning framework for end-to-end TOD modeling by adopting span prediction as an auxiliary task. In end-to-end setting, our model achieves new state-of-the-art results with combined scores of 108.3 and 107.5 on MultiWOZ 2.0 and MultiWOZ 2.1, respectively. Furthermore, we demonstrate that multi-task learning improves not only the performance of model but its generalization capability through domain adaptation experiments in the few-shot setting. The code is available at github.com/bepoetree/MTTOD.

pdf bib
EDTC: A Corpus for Discourse-Level Topic Chain Parsing
Longyin Zhang | Xin Tan | Fang Kong | Guodong Zhou

Discourse analysis has long been known to be fundamental in natural language processing. In this research, we present our insight on discourse-level topic chain (DTC) parsing which aims at discovering new topics and investigating how these topics evolve over time within an article. To address the lack of data, we contribute a new discourse corpus with DTC-style dependency graphs annotated upon news articles. In particular, we ensure the high reliability of the corpus by utilizing a two-step annotation strategy to build the data and filtering out the annotations with low confidence scores. Based on the annotated corpus, we introduce a simple yet robust system for automatic discourse-level topic chain parsing.

pdf bib
Multilingual Neural Machine Translation: Can Linguistic Hierarchies Help?
Fahimeh Saleh | Wray Buntine | Gholamreza Haffari | Lan Du

Multilingual Neural Machine Translation (MNMT) trains a single NMT model that supports translation between multiple languages, rather than training separate models for different languages. Learning a single model can enhance the low-resource translation by leveraging data from multiple languages. However, the performance of an MNMT model is highly dependent on the type of languages used in training, as transferring knowledge from a diverse set of languages degrades the translation performance due to negative transfer. In this paper, we propose a Hierarchical Knowledge Distillation (HKD) approach for MNMT which capitalises on language groups generated according to typological features and phylogeny of languages to overcome the issue of negative transfer. HKD generates a set of multilingual teacher-assistant models via a selective knowledge distillation mechanism based on the language groups, and then distills the ultimate multilingual model from those assistants in an adaptive way. Experimental results derived from the TED dataset with 53 languages demonstrate the effectiveness of our approach in avoiding the negative transfer effect in MNMT, leading to an improved translation performance (about 1 BLEU score in average) compared to strong baselines.

pdf bib
Self Question-answering: Aspect-based Sentiment Analysis by Role Flipped Machine Reading Comprehension
Guoxin Yu | Jiwei Li | Ling Luo | Yuxian Meng | Xiang Ao | Qing He

The pivot for the unified Aspect-based Sentiment Analysis (ABSA) is to couple aspect terms with their corresponding opinion terms, which might further derive easier sentiment predictions. In this paper, we investigate the unified ABSA task from the perspective of Machine Reading Comprehension (MRC) by observing that the aspect and the opinion terms can serve as the query and answer in MRC interchangeably. We propose a new paradigm named Role Flipped Machine Reading Comprehension (RF-MRC) to resolve. At its heart, the predicted results of either the Aspect Term Extraction (ATE) or the Opinion Terms Extraction (OTE) are regarded as the queries, respectively, and the matched opinion or aspect terms are considered as answers. The queries and answers can be flipped for multi-hop detection. Finally, every matched aspect-opinion pair is predicted by the sentiment classifier. RF-MRC can solve the ABSA task without any additional data annotation or transformation. Experiments on three widely used benchmarks and a challenging dataset demonstrate the superiority of the proposed framework.

pdf bib
Generalization in Text-based Games via Hierarchical Reinforcement Learning
Yunqiu Xu | Meng Fang | Ling Chen | Yali Du | Chengqi Zhang

Deep reinforcement learning provides a promising approach for text-based games in studying natural language communication between humans and artificial agents. However, the generalization still remains a big challenge as the agents depend critically on the complexity and variety of training tasks. In this paper, we address this problem by introducing a hierarchical framework built upon the knowledge graph-based RL agent. In the high level, a meta-policy is executed to decompose the whole game into a set of subtasks specified by textual goals, and select one of them based on the KG. Then a sub-policy in the low level is executed to conduct goal-conditioned reinforcement learning. We carry out experiments on games with various difficulty levels and show that the proposed method enjoys favorable generalizability.

pdf bib
A Finer-grain Universal Dialogue Semantic Structures based Model For Abstractive Dialogue Summarization
Yuejie Lei | Fujia Zheng | Yuanmeng Yan | Keqing He | Weiran Xu

Although abstractive summarization models have achieved impressive results on document summarization tasks, their performance on dialogue modeling is much less satisfactory due to the crude and straight methods for dialogue encoding. To address this question, we propose a novel end-to-end Transformer-based model FinDS for abstractive dialogue summarization that leverages Finer-grain universal Dialogue semantic Structures to model dialogue and generates better summaries. Experiments on the SAMsum dataset show that FinDS outperforms various dialogue summarization approaches and achieves new state-of-the-art (SOTA) ROUGE results. Finally, we apply FinDS to a more complex scenario, showing the robustness of our model. We also release our source code.

pdf bib
Constructing contrastive samples via summarization for text classification with limited annotations
Yangkai Du | Tengfei Ma | Lingfei Wu | Fangli Xu | Xuhong Zhang | Bo Long | Shouling Ji

Contrastive Learning has emerged as a powerful representation learning method and facilitates various downstream tasks especially when supervised data is limited. How to construct efficient contrastive samples through data augmentation is key to its success. Unlike vision tasks, the data augmentation method for contrastive learning has not been investigated sufficiently in language tasks. In this paper, we propose a novel approach to construct contrastive samples for language tasks using text summarization. We use these samples for supervised contrastive learning to gain better text representations which greatly benefit text classification tasks with limited annotations. To further improve the method, we mix up samples from different classes and add an extra regularization, named Mixsum, in addition to the cross-entropy-loss. Experiments on real-world text classification datasets (Amazon-5, Yelp-5, AG News, and IMDb) demonstrate the effectiveness of the proposed contrastive learning framework with summarization-based data augmentation and Mixsum regularization.

pdf bib
End-to-end Neural Information Status Classification
Yufang Hou

Most previous studies on information status (IS) classification and bridging anaphora recognition assume that the gold mention or syntactic tree information is given (Hou et al., 2013; Roesiger et al., 2018; Hou, 2020; Yu and Poesio, 2020). In this paper, we propose an end-to-end neural approach for information status classification. Our approach consists of a mention extraction component and an information status assignment component. During the inference time, our system takes a raw text as the input and generates mentions together with their information status. On the ISNotes corpus (Markert et al., 2012), we show that our information status assignment component achieves new state-of-the-art results on fine-grained IS classification based on gold mentions. Furthermore, our system performs significantly better than other baselines for both mention extraction and fine-grained IS classification in the end-to-end setting. Finally, we apply our system on BASHI (Roesiger, 2018) and SciCorp (Roesiger, 2016) to recognize referential bridging anaphora. We find that our end-to-end system trained on ISNotes achieves competitive results on bridging anaphora recognition compared to the previous state-of-the-art system that relies on syntactic information and is trained on the in-domain datasets (Yu and Poesio, 2020).

pdf bib
EventKE: Event-Enhanced Knowledge Graph Embedding
Zixuan Zhang | Hongwei Wang | Han Zhao | Hanghang Tong | Heng Ji

Relations in most of the traditional knowledge graphs (KGs) only reflect static and factual connections, but fail to represent the dynamic activities and state changes about entities. In this paper, we emphasize the importance of incorporating events in KG representation learning, and propose an event-enhanced KG embedding model EventKE. Specifically, given the original KG, we first incorporate event nodes by building a heterogeneous network, where entity nodes and event nodes are distributed on the two sides of the network inter-connected by event argument links. We then use entity-entity relations from the original KG and event-event temporal links to inner-connect entity and event nodes respectively. We design a novel and effective attention-based message passing method, which is conducted on entity-entity, event-entity, and event-event relations to fuse the event information into KG embeddings. Experimental results on real-world datasets demonstrate that events can greatly improve the quality of the KG embeddings on multiple downstream tasks.

pdf bib
Modeling Concentrated Cross-Attention for Neural Machine Translation with Gaussian Mixture Model
Shaolei Zhang | Yang Feng

Cross-attention is an important component of neural machine translation (NMT), which is always realized by dot-product attention in previous methods. However, dot-product attention only considers the pair-wise correlation between words, resulting in dispersion when dealing with long sentences and neglect of source neighboring relationships. Inspired by linguistics, the above issues are caused by ignoring a type of cross-attention, called concentrated attention, which focuses on several central words and then spreads around them. In this work, we apply Gaussian Mixture Model (GMM) to model the concentrated attention in cross-attention. Experiments and analyses we conducted on three datasets show that the proposed method outperforms the baseline and has significant improvement on alignment quality, N-gram accuracy, and long sentence translation.

pdf bib
Inconsistency Matters: A Knowledge-guided Dual-inconsistency Network for Multi-modal Rumor Detection
Mengzhu Sun | Xi Zhang | Jianqiang Ma | Yazheng Liu

Rumor spreaders are increasingly utilizing multimedia content to attract the attention and trust of news consumers. Though a set of rumor detection models have exploited the multi-modal data, they seldom consider the inconsistent relationships among images and texts. Moreover, they also fail to find a powerful way to spot the inconsistency information among the post contents and background knowledge. Motivated by the intuition that rumors are more likely to have inconsistency information in semantics, a novel Knowledge-guided Dual-inconsistency network is proposed to detect rumors with multimedia contents. It can capture the inconsistent semantics at the cross-modal level and the content-knowledge level in one unified framework. Extensive experiments on two public real-world datasets demonstrate that our proposal can outperform the state-of-the-art baselines.

pdf bib
EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation
Chenhe Dong | Guangrun Wang | Hang Xu | Jiefeng Peng | Xiaozhe Ren | Xiaodan Liang

Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that improving the feed-forward network (FFN) in BERT has a higher gain than improving the multi-head attention (MHA) since the computational cost of FFN is 2~3 times larger than MHA. Hence, to compact BERT, we are devoted to designing efficient FFN as opposed to previous works that pay attention to MHA. Since FFN comprises a multilayer perceptron (MLP) that is essential in BERT optimization, we further design a thorough search space towards an advanced MLP and perform a coarse-to-fine mechanism to search for an efficient BERT architecture. Moreover, to accelerate searching and enhance model transferability, we employ a novel warm-up knowledge distillation strategy at each search stage. Extensive experiments show our searched EfficientBERT is 6.9× smaller and 4.4× faster than BERT\rmBASE, and has competitive performances on GLUE and SQuAD Benchmarks. Concretely, EfficientBERT attains a 77.7 average score on GLUE test, 0.7 higher than MobileBERT\rmTINY, and achieves an 85.3/74.5 F1 score on SQuAD v1.1/v2.0 dev, 3.2/2.7 higher than TinyBERT4 even without data augmentation. The code is released at https://github.com/cheneydon/efficient-bert.

pdf bib
Uni-FedRec: A Unified Privacy-Preserving News Recommendation Framework for Model Training and Online Serving
Tao Qi | Fangzhao Wu | Chuhan Wu | Yongfeng Huang | Xing Xie

News recommendation techniques can help users on news platforms obtain their preferred news information. Most existing news recommendation methods rely on centrally stored user behavior data to train models and serve users. However, user data is usually highly privacy-sensitive, and centrally storing them in the news platform may raise privacy concerns and risks. In this paper, we propose a unified news recommendation framework, which can utilize user data locally stored in user clients to train models and serve users in a privacy-preserving way. Following a widely used paradigm in real-world recommender systems, our framework contains a stage for candidate news generation (i.e., recall) and a stage for candidate news ranking (i.e., ranking). At the recall stage, each client locally learns multiple interest representations from clicked news to comprehensively model user interests. These representations are uploaded to the server to recall candidate news from a large news pool, which are further distributed to the user client at the ranking stage for personalized news display. In addition, we propose an interest decomposer-aggregator method with perturbation noise to better protect private user information encoded in user interest representations. Besides, we collaboratively train both recall and ranking models on the data decentralized in a large number of user clients in a privacy-preserving way. Experiments on two real-world news datasets show that our method can outperform baseline methods and effectively protect user privacy.

pdf bib
Mapping Language to Programs using Multiple Reward Components with Inverse Reinforcement Learning
Sayan Ghosh | Shashank Srivastava

Mapping natural language instructions to programs that computers can process is a fundamental challenge. Existing approaches focus on likelihood-based training or using reinforcement learning to fine-tune models based on a single reward. In this paper, we pose program generation from language as Inverse Reinforcement Learning. We introduce several interpretable reward components and jointly learn (1) a reward function that linearly combines them, and (2) a policy for program generation. Fine-tuning with our approach achieves significantly better performance than competitive methods using Reinforcement Learning (RL). On the VirtualHome framework, we get improvements of up to 9.0% on the Longest Common Subsequence metric and 14.7% on recall-based metrics over previous work on this framework (Puig et al., 2018). The approach is data-efficient, showing larger gains in performance in the low-data regime. Generated programs are also preferred by human evaluators over an RL-based approach, and rated higher on relevance, completeness, and human-likeness.

pdf bib
Topic-Guided Abstractive Multi-Document Summarization
Peng Cui | Le Hu

A critical point of multi-document summarization (MDS) is to learn the relations among various documents. In this paper, we propose a novel abstractive MDS model, in which we represent multiple documents as a heterogeneous graph, taking semantic nodes of different granularities into account, and then apply a graph-to-sequence framework to generate summaries. Moreover, we employ a neural topic model to jointly discover latent topics that can act as cross-document semantic units to bridge different documents and provide global information to guide the summary generation. Since topic extraction can be viewed as a special type of summarization that “summarizes” texts into a more abstract format, i.e., a topic distribution, we adopt a multi-task learning strategy to jointly train the topic and summarization module, allowing the promotion of each other. Experimental results on the Multi-News dataset demonstrate that our model outperforms previous state-of-the-art MDS models on both Rouge scores and human evaluation, meanwhile learns high-quality topics.

pdf bib
An Edge-Enhanced Hierarchical Graph-to-Tree Network for Math Word Problem Solving
Qinzhuo Wu | Qi Zhang | Zhongyu Wei

Math word problem solving has attracted considerable research interest in recent years. Previous works have shown the effectiveness of utilizing graph neural networks to capture the relationships in the problem. However, these works did not carefully take the edge label information and the long-range word relationship across sentences into consideration. In addition, during generation, they focus on the most relevant areas of the currently generated word, while neglecting the rest of the problem. In this paper, we propose a novel Edge-Enhanced Hierarchical Graph-to-Tree model (EEH-G2T), in which the math word problems are represented as edge-labeled graphs. Specifically, an edge-enhanced hierarchical graph encoder is used to incorporate edge label information. This encoder updates the graph nodes hierarchically in two steps: sentence-level aggregation and problem-level aggregation. Furthermore, a tree-structured decoder with a split attention mechanism is applied to guide the model to pay attention to different parts of the input problem. Experimental results on the MAWPS and Math23K dataset showed that our EEH-G2T can effectively improve performance compared with state-of-the-art methods.

pdf bib
SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation
Hong Chen | Hiroya Takamura | Hideki Nakayama

Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.

pdf bib
Don’t Miss the Potential Customers! Retrieving Similar Ads to Improve User Targeting
Yi Feng | Ting Wang | Chuanyi Li | Vincent Ng | Jidong Ge | Bin Luo | Yucheng Hu | Xiaopeng Zhang

User targeting is an essential task in the modern advertising industry: given a package of ads for a particular category of products (e.g., green tea), identify the online users to whom the ad package should be targeted. A (ad package specific) user targeting model is typically trained using historical clickthrough data: positive instances correspond to users who have clicked on an ad in the package before, whereas negative instances correspond to users who have not clicked on any ads in the package that were displayed to them. Collecting a sufficient amount of positive training data for training an accurate user targeting model, however, is by no means trivial. This paper focuses on the development of a method for automatic augmentation of the set of positive training instances. Experimental results on two datasets, including a real-world company dataset, demonstrate the effectiveness of our proposed method.

pdf bib
Cross-lingual Transfer for Text Classification with Dictionary-based Heterogeneous Graph
Nuttapong Chairatanakul | Noppayut Sriwatanasakdi | Nontawat Charoenphakdee | Xin Liu | Tsuyoshi Murata

In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available, where the task is identical to that of a low-resource target language. However, collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns. This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries. First, we construct a dictionary-based heterogeneous graph (DHG) from bilingual dictionaries. This opens the possibility to use graph neural networks for cross-lingual transfer. The remaining challenge is the heterogeneity of DHG because multiple languages are considered. To address this challenge, we propose dictionary-based heterogeneous graph neural network (DHGNet) that effectively handles the heterogeneity of DHG by two-step aggregations, which are word-level and language-level aggregations. Experimental results demonstrate that our method outperforms pretrained models even though it does not access to large corpora. Furthermore, it can perform well even though dictionaries contain many incorrect translations. Its robustness allows the usage of a wider range of dictionaries such as an automatically constructed dictionary and crowdsourced dictionary, which are convenient for real-world applications.

pdf bib
Improving Distantly-Supervised Named Entity Recognition with Self-Collaborative Denoising Learning
Xinghua Zhang | Bowen Yu | Tingwen Liu | Zhenyu Zhang | Jiawei Sheng | Xue Mengge | Hongbo Xu

Distantly supervised named entity recognition (DS-NER) efficiently reduces labor costs but meanwhile intrinsically suffers from the label noise due to the strong assumption of distant supervision. Typically, the wrongly labeled instances comprise numbers of incomplete and inaccurate annotations, while most prior denoising works are only concerned with one kind of noise and fail to fully explore useful information in the training set. To address this issue, we propose a robust learning paradigm named Self-Collaborative Denoising Learning (SCDL), which jointly trains two teacher-student networks in a mutually-beneficial manner to iteratively perform noisy label refinery. Each network is designed to exploit reliable labels via self denoising, and two networks communicate with each other to explore unreliable annotations by collaborative denoising. Extensive experimental results on five real-world datasets demonstrate that SCDL is superior to state-of-the-art DS-NER denoising methods.

pdf bib
Entity-Based Semantic Adequacy for Data-to-Text Generation
Juliette Faille | Albert Gatt | Claire Gardent

While powerful pre-trained language models have improved the fluency of text generation models, semantic adequacy -the ability to generate text that is semantically faithful to the input- remains an unsolved issue. In this paper, we introduce a novel automatic evaluation metric, Entity-Based Semantic Adequacy, which can be used to assess to what extent generation models that verbalise RDF (Resource Description Framework) graphs produce text that contains mentions of the entities occurring in the RDF input. This is important as RDF subject and object entities make up 2/3 of the input. We use our metric to compare 25 models from the WebNLG Shared Tasks and we examine correlation with results from human evaluations of semantic adequacy. We show that while our metric correlates with human evaluation scores, this correlation varies with the specifics of the human evaluation setup. This suggests that in order to measure the entity-based adequacy of generated texts, an automatic metric such as the one proposed here might be more reliable, as less subjective and more focused on correct verbalisation of the input, than human evaluation measures.

pdf bib
MiRANews: Dataset and Benchmarks for Multi-Resource-Assisted News Summarization
Xinnuo Xu | Ondřej Dušek | Shashi Narayan | Verena Rieser | Ioannis Konstas

One of the most challenging aspects of current single-document news summarization is that the summary often contains ‘extrinsic hallucinations’, i.e., facts that are not present in the source document, which are often derived via world knowledge. This causes summarisation systems to act more like open-ended language models tending to hallucinate facts that are erroneous. In this paper, we mitigate this problem with the help of multiple supplementary resource documents assisting the task. We present a new dataset MiraNews and benchmark existing summarisation models. In contrast to multi-document summarization, which addresses multiple events from several source documents, we still aim at generating a summary for a single document. We show via data analysis that it’s not only the models which are to blame: more than 27% of facts mentioned in the gold summaries of MiraNews are better grounded on assisting documents than in the main source articles. An error analysis of generated summaries from pretrained models fine-tuned on MIRANEWS reveals that this has an even bigger effects on models: assisted summarisation reduces 55% of hallucinations when compared to single-document summarisation models trained on the main article only.

pdf bib
A Conditional Generative Matching Model for Multi-lingual Reply Suggestion
Budhaditya Deb | Guoqing Zheng | Milad Shokouhi | Ahmed Hassan Awadallah

We study the problem of multilingual automated reply suggestions (RS) model serving many languages simultaneously. Multilingual models are often challenged by model capacity and severe data distribution skew across languages. While prior works largely focus on monolingual models, we propose Conditional Generative Matching models (CGM), optimized within a Variational Autoencoder framework to address challenges arising from multilingual RS. CGM does so with expressive message conditional priors, mixture densities to enhance multilingual data representation, latent alignment for language discrimination, and effective variational optimization techniques for training multilingual RS. The enhancements result in performance that exceed competitive baselines in relevance (ROUGE score) by more than 10% on average, and 16%for low resource languages. CGM also shows remarkable improvements in diversity (80%) illustrating its expressiveness in representation of multi-lingual data.

pdf bib
Rethinking Sentiment Style Transfer
Ping Yu | Yang Zhao | Chunyuan Li | Changyou Chen

Though remarkable efforts have been made in non-parallel text style transfer, the evaluation system is unsatisfactory. It always evaluates over samples from only one checkpoint of the model and compares three metrics, i.e., transfer accuracy, BLEU score, and PPL score. In this paper, we argue the inappropriateness of both existing evaluation metrics and the evaluation method. Specifically, for evaluation metrics, we make a detailed analysis and comparison from three aspects: style transfer, content preservation, and naturalness; for the evaluation method, we reiterate the fallacy of picking one checkpoint for model comparison. As a result, we establish a robust evaluation method by examining the trade-off between style transfer and naturalness, and between content preservation and naturalness. Notably, we elaborate the human evaluation and automatically identify the inaccurate measurement of content preservation computed by the BLEU score. To overcome this issue, we propose a graph-based method to extract attribute content and attribute-independent content from input sentences in the YELP dataset and IMDB dataset. With the modified datasets, we design a new evaluation metric called “attribute hit” and propose an efficient regularization to leverage the attribute-dependent content and attribute-independent content as guiding signals. Experimental results have demonstrated the effectiveness of the proposed strategy.

pdf bib
HypoGen: Hyperbole Generation with Commonsense and Counterfactual Knowledge
Yufei Tian | Arvind krishna Sridhar | Nanyun Peng

A hyperbole is an intentional and creative exaggeration not to be taken literally. Despite its ubiquity in daily life, the computational explorations of hyperboles are scarce. In this paper, we tackle the under-explored and challenging task: sentence-level hyperbole generation. We start with a representative syntactic pattern for intensification and systematically study the semantic (commonsense and counterfactual) relationships between each component in such hyperboles. We then leverage commonsense and counterfactual inference to generate hyperbole candidates based on our findings from the pattern, and train neural classifiers to rank and select high-quality hyperboles. Automatic and human evaluations show that our generation method is able to generate hyperboles with high success rate, intensity, funniness, and creativity.

pdf bib
Profiling News Discourse Structure Using Explicit Subtopic Structures Guided Critics
Prafulla Kumar Choubey | Ruihong Huang

We present an actor-critic framework to induce subtopical structures in a news article for news discourse profiling. The model uses multiple critics that act according to known subtopic structures while the actor aims to outperform them. The content structures constitute sentences that represent latent subtopic boundaries. Then, we introduce a hierarchical neural network that uses the identified subtopic boundary sentences to model multi-level interaction between sentences, subtopics, and the document. Experimental results and analyses on the NewsDiscourse corpus show that the actor model learns to effectively segment a document into subtopics and improves the performance of the hierarchical model on the news discourse profiling task.

pdf bib
ProtoInfoMax: Prototypical Networks with Mutual Information Maximization for Out-of-Domain Detection
Iftitahu Nimah | Meng Fang | Vlado Menkovski | Mykola Pechenizkiy

The ability to detect Out-of-Domain (OOD) inputs has been a critical requirement in many real-world NLP applications. For example, intent classification in dialogue systems. The reason is that the inclusion of unsupported OOD inputs may lead to catastrophic failure of systems. However, it remains an empirical question whether current methods can tackle such problems reliably in a realistic scenario where zero OOD training data is available. In this study, we propose ProtoInfoMax, a new architecture that extends Prototypical Networks to simultaneously process in-domain and OOD sentences via Mutual Information Maximization (InfoMax) objective. Experimental results show that our proposed method can substantially improve performance up to 20% for OOD detection in low resource settings of text classification. We also show that ProtoInfoMax is less prone to typical overconfidence errors of Neural Networks, leading to more reliable prediction results.

pdf bib
Learning from Language Description: Low-shot Named Entity Recognition via Decomposed Framework
Yaqing Wang | Haoda Chu | Chao Zhang | Jing Gao

In this work, we study the problem of named entity recognition (NER) in a low resource scenario, focusing on few-shot and zero-shot settings. Built upon large-scale pre-trained language models, we propose a novel NER framework, namely SpanNER, which learns from natural language supervision and enables the identification of never-seen entity classes without using in-domain labeled data. We perform extensive experiments on 5 benchmark datasets and evaluate the proposed method in the few-shot learning, domain transfer and zero-shot learning settings. The experimental results show that the proposed method can bring 10%, 23% and 26% improvements in average over the best baselines in few-shot learning, domain transfer and zero-shot learning settings respectively.

pdf bib
BERT might be Overkill: A Tiny but Effective Biomedical Entity Linker based on Residual Convolutional Neural Networks
Tuan Lai | Heng Ji | ChengXiang Zhai

Biomedical entity linking is the task of linking entity mentions in a biomedical document to referent entities in a knowledge base. Recently, many BERT-based models have been introduced for the task. While these models achieve competitive results on many datasets, they are computationally expensive and contain about 110M parameters. Little is known about the factors contributing to their impressive performance and whether the over-parameterization is needed. In this work, we shed some light on the inner workings of these large BERT-based models. Through a set of probing experiments, we have found that the entity linking performance only changes slightly when the input word order is shuffled or when the attention scope is limited to a fixed window size. From these observations, we propose an efficient convolutional neural network with residual connections for biomedical entity linking. Because of the sparse connectivity and weight sharing properties, our model has a small number of parameters and is highly efficient. On five public datasets, our model achieves comparable or even better linking accuracy than the state-of-the-art BERT-based models while having about 60 times fewer parameters.

pdf bib
Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar | Bryan McCann | Tong Niu | Nazneen Rajani | Nitish Shirish Keskar | Thamar Solorio

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed–and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.

pdf bib
Exploring Multitask Learning for Low-Resource Abstractive Summarization
Ahmed Magooda | Diane Litman | Mohamed Elaraby

This paper explores the effect of using multitask learning for abstractive summarization in the context of small training corpora. In particular, we incorporate four different tasks (extractive summarization, language modeling, concept detection, and paraphrase detection) both individually and in combination, with the goal of enhancing the target task of abstractive summarization via multitask learning. We show that for many task combinations, a model trained in a multitask setting outperforms a model trained only for abstractive summarization, with no additional summarization data introduced. Additionally, we do a comprehensive search and find that certain tasks (e.g. paraphrase detection) consistently benefit abstractive summarization, not only when combined with other tasks but also when using different architectures and training corpora.

pdf bib
Conical Classification For Efficient One-Class Topic Determination
Sameer Khanna

As the Internet grows in size, so does the amount of text based information that exists. For many application spaces it is paramount to isolate and identify texts that relate to a particular topic. While one-class classification would be ideal for such analysis, there is a relative lack of research regarding efficient approaches with high predictive power. By noting that the range of documents we wish to identify can be represented as positive linear combinations of the Vector Space Model representing our text, we propose Conical classification, an approach that allows us to identify if a document is of a particular topic in a computationally efficient manner. We also propose Normal Exclusion, a modified version of Bi-Normal Separation that makes it more suitable within the one-class classification context. We show in our analysis that our approach not only has higher predictive power on our datasets, but is also faster to compute.

pdf bib
Improving Dialogue State Tracking with Turn-based Loss Function and Sequential Data Augmentation
Jarana Manotumruksa | Jeff Dalton | Edgar Meij | Emine Yilmaz

While state-of-the-art Dialogue State Tracking (DST) models show promising results, all of them rely on a traditional cross-entropy loss function during the training process, which may not be optimal for improving the joint goal accuracy. Although several approaches recently propose augmenting the training set by copying user utterances and replacing the real slot values with other possible or even similar values, they are not effective at improving the performance of existing DST models. To address these challenges, we propose a Turn-based Loss Function (TLF) that penalises the model if it inaccurately predicts a slot value at the early turns more so than in later turns in order to improve joint goal accuracy. We also propose a simple but effective Sequential Data Augmentation (SDA) algorithm to generate more complex user utterances and system responses to effectively train existing DST models. Experimental results on two standard DST benchmark collections demonstrate that our proposed TLF and SDA techniques significantly improve the effectiveness of the state-of-the-art DST model by approximately 7-8% relative reduction in error and achieves a new state-of-the-art joint goal accuracy with 59.50 and 54.90 on MultiWOZ2.1 and MultiWOZ2.2, respectively.

pdf bib
TIAGE: A Benchmark for Topic-Shift Aware Dialog Modeling
Huiyuan Xie | Zhenghao Liu | Chenyan Xiong | Zhiyuan Liu | Ann Copestake

Human conversations naturally evolve around different topics and fluently move between them. In research on dialog systems, the ability to actively and smoothly transition to new topics is often ignored. In this paper we introduce TIAGE, a new topic-shift aware dialog benchmark constructed utilizing human annotations on topic shifts. Based on TIAGE, we introduce three tasks to investigate different scenarios of topic-shift modeling in dialog settings: topic-shift detection, topic-shift triggered response generation and topic-aware dialog generation. Experiments on these tasks show that the topic-shift signals in TIAGE are useful for topic-shift response generation. On the other hand, dialog systems still struggle to decide when to change topic. This indicates further research is needed in topic-shift aware dialog modeling.

pdf bib
Optimal Neural Program Synthesis from Multimodal Specifications
Xi Ye | Qiaochu Chen | Isil Dillig | Greg Durrett

Multimodal program synthesis, which leverages different types of user input to synthesize a desired program, is an attractive way to scale program synthesis to challenging settings; however, it requires integrating noisy signals from the user, like natural language, with hard constraints on the program’s behavior. This paper proposes an optimal neural synthesis approach where the goal is to find a program that satisfies user-provided constraints while also maximizing the program’s score with respect to a neural model. Specifically, we focus on multimodal synthesis tasks in which the user intent is expressed using a combination of natural language (NL) and input-output examples. At the core of our method is a top-down recurrent neural model that places distributions over abstract syntax trees conditioned on the NL input. This model not only allows for efficient search over the space of syntactically valid programs, but it allows us to leverage automated program analysis techniques for pruning the search space based on infeasibility of partial programs with respect to the user’s constraints. The experimental results on a multimodal synthesis dataset (StructuredRegex) show that our method substantially outperforms prior state-of-the-art techniques in terms of accuracy and efficiency, and finds model-optimal programs more frequently.

pdf bib
Sent2Span: Span Detection for PICO Extraction in the Biomedical Text without Span Annotations
Shifeng Liu | Yifang Sun | Bing Li | Wei Wang | Florence T. Bourgeois | Adam G. Dunn

The rapid growth in published clinical trials makes it difficult to maintain up-to-date systematic reviews, which require finding all relevant trials. This leads to policy and practice decisions based on out-of-date, incomplete, and biased subsets of available clinical evidence. Extracting and then normalising Population, Intervention, Comparator, and Outcome (PICO) information from clinical trial articles may be an effective way to automatically assign trials to systematic reviews and avoid searching and screening—the two most time-consuming systematic review processes. We propose and test a novel approach to PICO span detection. The major difference between our proposed method and previous approaches comes from detecting spans without needing annotated span data and using only crowdsourced sentence-level annotations. Experiments on two datasets show that PICO span detection results achieve much higher results for recall when compared to fully supervised methods with PICO sentence detection at least as good as human annotations. By removing the reliance on expert annotations for span detection, this work could be used in a human-machine pipeline for turning low-quality, crowdsourced, and sentence-level PICO annotations into structured information that can be used to quickly assign trials to relevant systematic reviews.

pdf bib
When in Doubt: Improving Classification Performance with Alternating Normalization
Menglin Jia | Austin Reiter | Ser-Nam Lim | Yoav Artzi | Claire Cardie

We introduce Classification with Alternating Normalization (CAN), a non-parametric post-processing step for classification. CAN improves classification accuracy for challenging examples by re-adjusting their predicted class probability distribution using the predicted class distributions of high-confidence validation examples. CAN is easily applicable to any probabilistic classifier, with minimal computation overhead. We analyze the properties of CAN using simulated experiments, and empirically demonstrate its effectiveness across a diverse set of classification tasks.

pdf bib
APGN: Adversarial and Parameter Generation Networks for Multi-Source Cross-Domain Dependency Parsing
Ying Li | Meishan Zhang | Zhenghua Li | Min Zhang | Zhefeng Wang | Baoxing Huai | Nicholas Jing Yuan

Thanks to the strong representation learning capability of deep learning, especially pre-training techniques with language model loss, dependency parsing has achieved great performance boost in the in-domain scenario with abundant labeled training data for target domains. However, the parsing community has to face the more realistic setting where the parsing performance drops drastically when labeled data only exists for several fixed out-domains. In this work, we propose a novel model for multi-source cross-domain dependency parsing. The model consists of two components, i.e., a parameter generation network for distinguishing domain-specific features, and an adversarial network for learning domain-invariant representations. Experiments on a recently released NLPCC-2019 dataset for multi-domain dependency parsing show that our model can consistently improve cross-domain parsing performance by about 2 points in averaged labeled attachment accuracy (LAS) over strong BERT-enhanced baselines. Detailed analysis is conducted to gain more insights on contributions of the two components.

pdf bib
“Let Your Characters Tell Their Story”: A Dataset for Character-Centric Narrative Understanding
Faeze Brahman | Meng Huang | Oyvind Tafjord | Chao Zhao | Mrinmaya Sachan | Snigdha Chaturvedi

When reading a literary piece, readers often make inferences about various characters’ roles, personalities, relationships, intents, actions, etc. While humans can readily draw upon their past experiences to build such a character-centric view of the narrative, understanding characters in narratives can be a challenging task for machines. To encourage research in this field of character-centric narrative understanding, we present LiSCU – a new dataset of literary pieces and their summaries paired with descriptions of characters that appear in them. We also introduce two new tasks on LiSCU: Character Identification and Character Description Generation. Our experiments with several pre-trained language models adapted for these tasks demonstrate that there is a need for better models of narrative comprehension.

pdf bib
Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation
Humair Raj Khan | Deepak Gupta | Asif Ekbal

Pre-trained language-vision models have shown remarkable performance on the visual question answering (VQA) task. However, most pre-trained models are trained by only considering monolingual learning, especially the resource-rich language like English. Training such models for multilingual setups demand high computing resources and multilingual language-vision dataset which hinders their application in practice. To alleviate these challenges, we propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student). Unlike the existing knowledge distillation methods, which only use the output from the last layer of the teacher network for distillation, our student model learns and imitates the teacher from multiple intermediate layers (language and vision encoders) with appropriately designed distillation objectives for incremental knowledge extraction. We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups considering the multiple Indian and European languages. Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.

pdf bib
An Iterative Multi-Knowledge Transfer Network for Aspect-Based Sentiment Analysis
Yunlong Liang | Fandong Meng | Jinchao Zhang | Yufeng Chen | Jinan Xu | Jie Zhou

Aspect-based sentiment analysis (ABSA) mainly involves three subtasks: aspect term extraction, opinion term extraction, and aspect-level sentiment classification, which are typically handled in a separate or joint manner. However, previous approaches do not well exploit the interactive relations among three subtasks and do not pertinently leverage the easily available document-level labeled domain/sentiment knowledge, which restricts their performances. To address these issues, we propose a novel Iterative Multi-Knowledge Transfer Network (IMKTN) for end-to-end ABSA. For one thing, through the interactive correlations between the ABSA subtasks, our IMKTN transfers the task-specific knowledge from any two of the three subtasks to another one at the token level by utilizing a well-designed routing algorithm, that is, any two of the three subtasks will help the third one. For another, our IMKTN pertinently transfers the document-level knowledge, i.e., domain-specific and sentiment-related knowledge, to the aspect-level subtasks to further enhance the corresponding performance. Experimental results on three benchmark datasets demonstrate the effectiveness and superiority of our approach.

pdf bib
Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding
Jiyeon Ham | Eun-Sol Kim

Measuring the similarity score between a pair of sentences in different languages is the essential requisite for multilingual sentence embedding methods. Predicting the similarity score consists of two sub-tasks, which are monolingual similarity evaluation and multilingual sentence retrieval. However, conventional methods have mainly tackled only one of the sub-tasks and therefore showed biased performances. In this paper, we suggest a novel and strong method for multilingual sentence embedding, which shows performance improvement on both sub-tasks, consequently resulting in robust predictions of multilingual similarity scores. The suggested method consists of two parts: to learn semantic similarity of sentences in the pivot language and then to extend the learned semantic structure to different languages. To align semantic structures across different languages, we introduce a teacher-student network. The teacher network distills the knowledge of the pivot language to different languages of the student network. During the distillation, the parameters of the teacher network are updated with the slow-moving average. Together with the distillation and the parameter updating, the semantic structure of the student network can be directly aligned across different languages while preserving the ability to measure the semantic similarity. Thus, the multilingual training method drives performance improvement on multilingual similarity evaluation. The suggested model achieves the state-of-the-art performance on extended STS 2017 multilingual similarity evaluation as well as two sub-tasks, which are extended STS 2017 monolingual similarity evaluation and Tatoeba multilingual retrieval in 14 languages.

pdf bib
fBERT: A Neural Transformer for Identifying Offensive Content
Diptanu Sarkar | Marcos Zampieri | Tharindu Ranasinghe | Alexander Ororbia

Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over 1.4 million offensive instances. We evaluate fBERT’s performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.

pdf bib
WIKIBIAS: Detecting Multi-Span Subjective Biases in Language
Yang Zhong | Jingfeng Yang | Wei Xu | Diyi Yang

Biases continue to be prevalent in modern text and media, especially subjective bias – a special type of bias that introduces improper attitudes or presents a statement with the presupposition of truth. To tackle the problem of detecting and further mitigating subjective bias, we introduce a manually annotated parallel corpus WIKIBIAS with more than 4,000 sentence pairs from Wikipedia edits. This corpus contains annotations towards both sentence-level bias types and token-level biased segments. We present systematic analyses of our dataset and results achieved by a set of state-of-the-art baselines in terms of three tasks: bias classification, tagging biased segments, and neutralizing biased text. We find that current models still struggle with detecting multi-span biases despite their reasonable performances, suggesting that our dataset can serve as a useful research benchmark. We also demonstrate that models trained on our dataset can generalize well to multiple domains such as news and political speeches.

pdf bib
UnClE: Explicitly Leveraging Semantic Similarity to Reduce the Parameters of Word Embeddings
Zhi Li | Yuchen Zhai | Chengyu Wang | Minghui Qiu | Kailiang Li | Yin Zhang

Natural language processing (NLP) models often require a massive number of parameters for word embeddings, which limits their application on mobile devices. Researchers have employed many approaches, e.g. adaptive inputs, to reduce the parameters of word embeddings. However, existing methods rarely pay attention to semantic information. In this paper, we propose a novel method called Unique and Class Embeddings (UnClE), which explicitly leverages semantic similarity with weight sharing to reduce the dimensionality of word embeddings. Inspired by the fact that words with similar semantic can share a part of weights, we divide the embeddings of words into two parts: unique embedding and class embedding. The former is one-to-one mapping like traditional embedding, while the latter is many-to-one mapping and learn the representation of class information. Our method is suitable for both word-level and sub-word level models and can be used to reduce both input and output embeddings. Experimental results on the standard WMT 2014 English-German dataset show that our method is able to reduce the parameters of word embeddings by more than 11x, with about 93% performance retaining in BLEU metrics. For language modeling task, our model can reduce word embeddings by 6x or 11x on PTB/WT2 dataset at the cost of a certain degree of performance degradation.

pdf bib
Grounded Graph Decoding improves Compositional Generalization in Question Answering
Yu Gai | Paras Jain | Wendi Zhang | Joseph Gonzalez | Dawn Song | Ion Stoica

Question answering models struggle to generalize to novel compositions of training patterns. Current end-to-end models learn a flat input embedding which can lose input syntax context. Prior approaches improve generalization by learning permutation invariant models, but these methods do not scale to more complex train-test splits. We propose Grounded Graph Decoding, a method to improve compositional generalization of language representations by grounding structured predictions with an attention mechanism. Grounding enables the model to retain syntax information from the input that significantly improves generalization to complex inputs. By predicting a structured graph containing conjunctions of query clauses, we learn a group invariant representation without making assumptions on the target domain. Our model performs competitively on the Compositional Freebase Questions (CFQ) dataset, a challenging benchmark for compositional generalization in question answering. Especially, our model effectively solves the MCD1 split with 98% accuracy. All source is available at https://github.com/gaiyu0/cfq.

pdf bib
Enhancing Visual Dialog Questioner with Entity-based Strategy Learning and Augmented Guesser
Duo Zheng | Zipeng Xu | Fandong Meng | Xiaojie Wang | Jiaan Wang | Jie Zhou

Considering the importance of building a good Visual Dialog (VD) Questioner, many researchers study the topic under a Q-Bot-A-Bot image-guessing game setting, where the Questioner needs to raise a series of questions to collect information of an undisclosed image. Despite progress has been made in Supervised Learning (SL) and Reinforcement Learning (RL), issues still exist. Firstly, previous methods do not provide explicit and effective guidance for Questioner to generate visually related and informative questions. Secondly, the effect of RL is hampered by an incompetent component, i.e., the Guesser, who makes image predictions based on the generated dialogs and assigns rewards accordingly. To enhance VD Questioner: 1) we propose a Related entity enhanced Questioner (ReeQ) that generates questions under the guidance of related entities and learns entity-based questioning strategy from human dialogs; 2) we propose an Augmented Guesser that is strong and is optimized for VD especially. Experimental results on the VisDial v1.0 dataset show that our approach achieves state-of-the-art performance on both image-guessing task and question diversity. Human study further verifies that our model generates more visually related, informative and coherent questions.

pdf bib
A Pretraining Numerical Reasoning Model for Ordinal Constrained Question Answering on Knowledge Base
Yu Feng | Jing Zhang | Gaole He | Wayne Xin Zhao | Lemao Liu | Quan Liu | Cuiping Li | Hong Chen

Knowledge Base Question Answering (KBQA) is to answer natural language questions posed over knowledge bases (KBs). This paper targets at empowering the IR-based KBQA models with the ability of numerical reasoning for answering ordinal constrained questions. A major challenge is the lack of explicit annotations about numerical properties. To address this challenge, we propose a pretraining numerical reasoning model consisting of NumGNN and NumTransformer, guided by explicit self-supervision signals. The two modules are pretrained to encode the magnitude and ordinal properties of numbers respectively and can serve as model-agnostic plugins for any IR-based KBQA model to enhance its numerical reasoning ability. Extensive experiments on two KBQA benchmarks verify the effectiveness of our method to enhance the numerical reasoning ability for IR-based KBQA models.

pdf bib
RoR: Read-over-Read for Long Document Machine Reading Comprehension
Jing Zhao | Junwei Bao | Yifan Wang | Yongwei Zhou | Youzheng Wu | Xiaodong He | Bowen Zhou

Transformer-based pre-trained models, such as BERT, have achieved remarkable results on machine reading comprehension. However, due to the constraint of encoding length (e.g., 512 WordPiece tokens), a long document is usually split into multiple chunks that are independently read. It results in the reading field being limited to individual chunks without information collaboration for long document machine reading comprehension. To address this problem, we propose RoR, a read-over-read method, which expands the reading field from chunk to document. Specifically, RoR includes a chunk reader and a document reader. The former first predicts a set of regional answers for each chunk, which are then compacted into a highly-condensed version of the original document, guaranteeing to be encoded once. The latter further predicts the global answers from this condensed document. Eventually, a voting strategy is utilized to aggregate and rerank the regional and global answers for final prediction. Extensive experiments on two benchmarks QuAC and TriviaQA demonstrate the effectiveness of RoR for long document reading. Notably, RoR ranks 1st place on the QuAC leaderboard (https://quac.ai/) at the time of submission (May 17th, 2021).

pdf bib
Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing
Akshat Shrivastava | Pierce Chuang | Arun Babu | Shrey Desai | Abhinav Arora | Alexander Zotov | Ahmed Aly

An effective recipe for building seq2seq, non-autoregressive, task-oriented parsers to map utterances to semantic frames proceeds in three steps: encoding an utterance x, predicting a frame’s length |y|, and decoding a |y|-sized frame with utterance and ontology tokens. Though empirically strong, these models are typically bottlenecked by length prediction, as even small inaccuracies change the syntactic and semantic characteristics of resulting frames. In our work, we propose span pointer networks, non-autoregressive parsers which shift the decoding task from text generation to span prediction; that is, when imputing utterance spans into frame slots, our model produces endpoints (e.g., [i, j]) as opposed to text (e.g., “6pm”). This natural quantization of the output space reduces the variability of gold frames, therefore improving length prediction and, ultimately, exact match. Furthermore, length prediction is now responsible for frame syntax and the decoder is responsible for frame semantics, resulting in a coarse-to-fine model. We evaluate our approach on several task-oriented semantic parsing datasets. Notably, we bridge the quality gap between non-autogressive and autoregressive parsers, achieving 87 EM on TOPv2 (Chen et al. 2020). Furthermore, due to our more consistent gold frames, we show strong improvements in model generalization in both cross-domain and cross-lingual transfer in low-resource settings. Finally, due to our diminished output vocabulary, we observe 70% reduction in latency and 83% reduction in memory at beam size 5 compared to prior non-autoregressive parsers.

pdf bib
Language Resource Efficient Learning for Captioning
Jia Chen | Yike Wu | Shiwan Zhao | Qin Jin

Due to complex cognitive and inferential efforts involved in the manual generation of one caption per image/video input, the human annotation resources are very limited for captioning tasks. We define language resource efficient as reaching the same performance with fewer annotated captions per input. We first study the performance degradation of caption models in different language resource settings. Our analysis of caption models with SC loss shows that the performance degradation is caused by the increasingly noisy estimation of reward and baseline with fewer language resources. To mitigate this issue, we propose to reduce the variance of noise in the baseline by generalizing the single pairwise comparison in SC loss and using multiple generalized pairwise comparisons. The generalized pairwise comparison (GPC) measures the difference between the evaluation scores of two captions with respect to an input. Empirically, we show that the model trained with the proposed GPC loss is efficient on language resource and achieves similar performance with the state-of-the-art models on MSCOCO by using only half of the language resources. Furthermore, our model significantly outperforms the state-of-the-art models on a video caption dataset that has only one labeled caption per input in the training set.

pdf bib
Translation as Cross-Domain Knowledge: Attention Augmentation for Unsupervised Cross-Domain Segmenting and Labeling Tasks
Ruixuan Luo | Yi Zhang | Sishuo Chen | Xu Sun

The nature of no word delimiter or inflection that can indicate segment boundaries or word semantics increases the difficulty of Chinese text understanding, and also intensifies the demand for word-level semantic knowledge to accomplish the tagging goal in Chinese segmenting and labeling tasks. However, for unsupervised Chinese cross-domain segmenting and labeling tasks, the model trained on the source domain frequently suffers from the deficient word-level semantic knowledge of the target domain. To address this issue, we propose a novel paradigm based on attention augmentation to introduce crucial cross-domain knowledge via a translation system. The proposed paradigm enables the model attention to draw cross-domain knowledge indicated by the implicit word-level cross-lingual alignment between the input and its corresponding translation. Aside from the model requiring cross-lingual input, we also establish an off-the-shelf model which eludes the dependency on cross-lingual translations. Experiments demonstrate that our proposal significantly advances the state-of-the-art results of cross-domain Chinese segmenting and labeling tasks.

pdf bib
ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Yuta Koreeda | Christopher Manning

Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose “document-level natural language inference (NLI) for contracts”, a novel, real-world application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is “entailed by”, “contradicting to” or “not mentioned by” (neutral to) the contract as well as identifying “evidence” for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (a) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (b) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.

pdf bib
Japanese Zero Anaphora Resolution Can Benefit from Parallel Texts Through Neural Transfer Learning
Masato Umakoshi | Yugo Murawaki | Sadao Kurohashi

Parallel texts of Japanese and a non-pro-drop language have the potential of improving the performance of Japanese zero anaphora resolution (ZAR) because pronouns dropped in the former are usually mentioned explicitly in the latter. However, rule-based cross-lingual transfer is hampered by error propagation in an NLP pipeline and the frequent lack of transparency in translation correspondences. In this paper, we propose implicit transfer by injecting machine translation (MT) as an intermediate task between pretraining and ZAR. We employ a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT, and eject the encoder part for fine-tuning on ZAR. The proposed framework empirically demonstrates that ZAR performance can be improved by transfer learning from MT. In addition, we find that the incorporation of the masked language model training into MT leads to further gains.

pdf bib
Grouped-Attention for Content-Selection and Content-Plan Generation
Bayu Distiawan Trisedya | Xiaojie Wang | Jianzhong Qi | Rui Zhang | Qingjun Cui

Content-planning is an essential part of data-to-text generation to determine the order of data mentioned in generated texts. Recent neural data-to-text generation models employ Pointer Networks to explicitly learn content-plan given a set of attributes as input. They use LSTM to encode the input, which assumes a sequential relationship in the input. This may be sub-optimal to encode a set of attributes, where the attributes have a composite structure: the attributes are disordered while each attribute value is an ordered list of tokens. We handle this problem by proposing a neural content-planner that can capture both local and global contexts of such a structure. Specifically, we propose a novel attention mechanism called GSC-attention. A key component of the GSC-attention is grouped-attention, which is token-level attention constrained within each input attribute that enables our proposed model captures both local and global context. Moreover, our content-planner explicitly learns content-selection, which is integrated into the content-planner to select the most important data to be included in the generated text via an attention masking procedure. Experimental results show that our model outperforms the competitors by 4.92%, 4.70%, and 16.56% in terms of Damerau-Levenshtein Distance scores on three real-world datasets.

pdf bib
An Explicit-Joint and Supervised-Contrastive Learning Framework for Few-Shot Intent Classification and Slot Filling
Han Liu | Feng Zhang | Xiaotong Zhang | Siyang Zhao | Xianchao Zhang

Intent classification (IC) and slot filling (SF) are critical building blocks in task-oriented dialogue systems. These two tasks are closely-related and can flourish each other. Since only a few utterances can be utilized for identifying fast-emerging new intents and slots, data scarcity issue often occurs when implementing IC and SF. However, few IC/SF models perform well when the number of training samples per class is quite small. In this paper, we propose a novel explicit-joint and supervised-contrastive learning framework for few-shot intent classification and slot filling. Its highlights are as follows. (i) The model extracts intent and slot representations via bidirectional interactions, and extends prototypical network to achieve explicit-joint learning, which guarantees that IC and SF tasks can mutually reinforce each other. (ii) The model integrates with supervised contrastive learning, which ensures that samples from same class are pulled together and samples from different classes are pushed apart. In addition, the model follows a not common but practical way to construct the episode, which gets rid of the traditional setting with fixed way and shot, and allows for unbalanced datasets. Extensive experiments on three public datasets show that our model can achieve promising performance.

pdf bib
Retrieve, Discriminate and Rewrite: A Simple and Effective Framework for Obtaining Affective Response in Retrieval-Based Chatbots
Xin Lu | Yijian Tian | Yanyan Zhao | Bing Qin

Obtaining affective response is a key step in building empathetic dialogue systems. This task has been studied a lot in generation-based chatbots, but the related research in retrieval-based chatbots is still in the early stage. Existing works in retrieval-based chatbots are based on Retrieve-and-Rerank framework, which have a common problem of satisfying affect label at the expense of response quality. To address this problem, we propose a simple and effective Retrieve-Discriminate-Rewrite framework. The framework replaces the reranking mechanism with a new discriminate-and-rewrite mechanism, which predicts the affect label of the retrieved high-quality response via discrimination module and further rewrites the affect unsatisfied response via rewriting module. This can not only guarantee the quality of the response, but also satisfy the given affect label. In addition, another challenge for this line of research is the lack of an off-the-shelf affective response dataset. To address this problem and test our proposed framework, we annotate a Sentimental Douban Conversation Corpus based on the original Douban Conversation Corpus. Experimental results show that our proposed framework is effective and outperforms competitive baselines.

pdf bib
Span Fine-tuning for Pre-trained Language Models
Rongzhou Bao | Zhuosheng Zhang | Hai Zhao

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over consecutive words in pre-training could further improve the performance of PrLMs. However, given that span-level clues are introduced and fixed in pre-training, previous methods are time-consuming and lack of flexibility. To alleviate the inconvenience, this paper presents a novel span fine-tuning method for PrLMs, which facilitates the span setting to be adaptively determined by specific downstream tasks during the fine-tuning phase. In detail, any sentences processed by the PrLM will be segmented into multiple spans according to a pre-sampled dictionary. Then the segmentation information will be sent through a hierarchical CNN module together with the representation outputs of the PrLM and ultimately generate a span-enhanced representation. Experiments on GLUE benchmark show that the proposed span fine-tuning method significantly enhances the PrLM, and at the same time, offer more flexibility in an efficient way.

pdf bib
DIRECT: Direct and Indirect Responses in Conversational Text Corpus
Junya Takayama | Tomoyuki Kajiwara | Yuki Arase

We create a large-scale dialogue corpus that provides pragmatic paraphrases to advance technology for understanding the underlying intentions of users. While neural conversation models acquire the ability to generate fluent responses through training on a dialogue corpus, previous corpora have mainly focused on the literal meanings of utterances. However, in reality, people do not always present their intentions directly. For example, if a person said to the operator of a reservation service “I don’t have enough budget.”, they, in fact, mean “please find a cheaper option for me.” Our corpus provides a total of 71,498 indirect–direct utterance pairs accompanied by a multi-turn dialogue history extracted from the MultiWoZ dataset. In addition, we propose three tasks to benchmark the ability of models to recognize and generate indirect and direct utterances. We also investigated the performance of state-of-the-art pre-trained models as baselines.

pdf bib
Retrieval, Analogy, and Composition: A framework for Compositional Generalization in Image Captioning
Zhan Shi | Hui Liu | Martin Renqiang Min | Christopher Malon | Li Erran Li | Xiaodan Zhu

Image captioning systems are expected to have the ability to combine individual concepts when describing scenes with concept combinations that are not observed during training. In spite of significant progress in image captioning with the help of the autoregressive generation framework, current approaches fail to generalize well to novel concept combinations. We propose a new framework that revolves around probing several similar image caption training instances (retrieval), performing analogical reasoning over relevant entities in retrieved prototypes (analogy), and enhancing the generation process with reasoning outcomes (composition). Our method augments the generation model by referring to the neighboring instances in the training set to produce novel concept combinations in generated captions. We perform experiments on the widely used image captioning benchmarks. The proposed models achieve substantial improvement over the compared baselines on both composition-related evaluation metrics and conventional image captioning metrics.

pdf bib
TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
Adaku Uchendu | Zeyu Ma | Thai Le | Rui Zhang | Dongwon Lee

Recent progress in generative language models has enabled machines to generate astonishingly realistic texts. While there are many legitimate applications of such models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). However, to our best knowledge, there is currently no benchmark environment with datasets and tasks to systematically study the so-called ”Turing Test” problem for neural text generation methods. In this work, we present the TURINGBENCH benchmark environment, which is comprised of (1) a dataset with 200K human- or machine-generated samples across 20 labels Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large,GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large, FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2, (2) two benchmark tasks–i.e., Turing Test (TT) and Authorship Attribution (AA), and (3) a website with leaderboards. Our preliminary experimental results using TURINGBENCH show that GPT-3 and FAIR_wmt20 are the current winners, among all language models tested, in generating the most human-like indistinguishable texts with the lowest F1 score by five state-of-the-art TT detection models. The TURINGBENCH is available at: https://turingbench.ist.psu.edu/

pdf bib
Say ‘YES’ to Positivity: Detecting Toxic Language in Workplace Communications
Meghana Moorthy Bhat | Saghar Hosseini | Ahmed Hassan Awadallah | Paul Bennett | Weisheng Li

Workplace communication (e.g. email, chat, etc.) is a central part of enterprise productivity. Healthy conversations are crucial for creating an inclusive environment and maintaining harmony in an organization. Toxic communications at the workplace can negatively impact overall job satisfaction and are often subtle, hidden, or demonstrate human biases. The linguistic subtlety of mild yet hurtful conversations has made it difficult for researchers to quantify and extract toxic conversations automatically. While offensive language or hate speech has been extensively studied in social communities, there has been little work studying toxic communication in emails. Specifically, the lack of corpus, sparsity of toxicity in enterprise emails, and well-defined criteria for annotating toxic conversations have prevented researchers from addressing the problem at scale. We take the first step towards studying toxicity in workplace emails by providing (1) a general and computationally viable taxonomy to study toxic language at the workplace (2) a dataset to study toxic language at the workplace based on the taxonomy and (3) analysis on why offensive language and hate-speech datasets are not suitable to detect workplace toxicity.

pdf bib
Natural SQL: Making SQL Easier to Infer from Natural Language Specifications
Yujian Gan | Xinyun Chen | Jinxia Xie | Matthew Purver | John R. Woodward | John Drake | Qiaofu Zhang

Addressing the mismatch between natural language descriptions and the corresponding SQL queries is a key challenge for text-to-SQL translation. To bridge this gap, we propose an SQL intermediate representation (IR) called Natural SQL (NatSQL). Specifically, NatSQL preserves the core functionalities of SQL, while it simplifies the queries as follows: (1) dispensing with operators and keywords such as GROUP BY, HAVING, FROM, JOIN ON, which are usually hard to find counterparts in the text descriptions; (2) removing the need of nested subqueries and set operators; and (3) making the schema linking easier by reducing the required number of schema items. On Spider, a challenging text-to-SQL benchmark that contains complex and nested SQL queries, we demonstrate that NatSQL outperforms other IRs, and significantly improves the performance of several previous SOTA models. Furthermore, for existing models that do not support executable SQL generation, NatSQL easily enables them to generate executable SQL queries, and achieves the new state-of-the-art execution accuracy.

pdf bib
Mitigating Data Scarceness through Data Synthesis, Augmentation and Curriculum for Abstractive Summarization
Ahmed Magooda | Diane Litman

This paper explores three simple data manipulation techniques (synthesis, augmentation, curriculum) for improving abstractive summarization models without the need for any additional data. We introduce a method of data synthesis with paraphrasing, a data augmentation technique with sample mixing, and curriculum learning with two new difficulty metrics based on specificity and abstractiveness. We conduct experiments to show that these three techniques can help improve abstractive summarization across two summarization models and two different small datasets. Furthermore, we show that these techniques can improve performance when applied in isolation and when combined.

pdf bib
Self- and Pseudo-self-supervised Prediction of Speaker and Key-utterance for Multi-party Dialogue Reading Comprehension
Yiyang Li | Hai Zhao

Multi-party dialogue machine reading comprehension (MRC) brings tremendous challenge since it involves multiple speakers at one dialogue, resulting in intricate speaker information flows and noisy dialogue contexts. To alleviate such difficulties, previous models focus on how to incorporate these information using complex graph-based modules and additional manually labeled data, which is usually rare in real scenarios. In this paper, we design two labour-free self- and pseudo-self-supervised prediction tasks on speaker and key-utterance to implicitly model the speaker information flows, and capture salient clues in a long dialogue. Experimental results on two benchmark datasets have justified the effectiveness of our method over competitive baselines and current state-of-the-art models.

pdf bib
Few-Shot Novel Concept Learning for Semantic Parsing
Soham Dan | Osbert Bastani | Dan Roth

Humans are capable of learning novel concepts from very few examples; in contrast, state-of-the-art machine learning algorithms typically need thousands of examples to do so. In this paper, we propose an algorithm for learning novel concepts by representing them as programs over existing concepts. This way the concept learning problem is naturally a program synthesis problem and our algorithm learns from a few examples to synthesize a program representing the novel concept. In addition, we perform a theoretical analysis of our approach for the case where the program defining the novel concept over existing ones is context-free. We show that given a learned grammar-based parser and a novel production rule, we can augment the parser with the production rule in a way that provably generalizes. We evaluate our approach by learning concepts in the semantic parsing domain extended to the few-shot novel concept learning setting, showing that our approach significantly outperforms end-to-end neural semantic parsers.

pdf bib
Compositional Data and Task Augmentation for Instruction Following
Soham Dan | Xinran Han | Dan Roth

Executing natural language instructions in a physically grounded domain requires a model that understands both spatial concepts such as “left of” and “above”, and the compositional language used to identify landmarks and articulate instructions relative to them. In this paper, we study instruction understanding in the blocks world domain. Given an initial arrangement of blocks and a natural language instruction, the system executes the instruction by manipulating selected blocks. The highly compositional instructions are composed of atomic components and understanding these components is a necessary step to executing the instruction. We show that while end-to-end training (supervised only by the correct block location) fails to address the challenges of this task and performs poorly on instructions involving a single atomic component, knowledge-free auxiliary signals can be used to significantly improve performance by providing supervision for the instruction’s components. Specifically, we generate signals that aim at helping the model gradually understand components of the compositional instructions, as well as those that help it better understand spatial concepts, and show their benefit to the overall task for two datasets and two state-of-the-art (SOTA) models, especially when the training data is limited—which is usual in such tasks.

pdf bib
Are Factuality Checkers Reliable? Adversarial Meta-evaluation of Factuality in Summarization
Yiran Chen | Pengfei Liu | Xipeng Qiu

With the continuous upgrading of the summarization systems driven by deep neural networks, researchers have higher requirements on the quality of the generated summaries, which should be not only fluent and informative but also factually correct. As a result, the field of factual evaluation has developed rapidly recently. Despite its initial progress in evaluating generated summaries, the meta-evaluation methodologies of factuality metrics are limited in their opacity, leading to the insufficient understanding of factuality metrics’ relative advantages and their applicability. In this paper, we present an adversarial meta-evaluation methodology that allows us to (i) diagnose the fine-grained strengths and weaknesses of 6 existing top-performing metrics over 24 diagnostic test datasets, (ii) search for directions for further improvement by data augmentation. Our observations from this work motivate us to propose several calls for future research. We make all codes, diagnostic test datasets, trained factuality models available: https://github.com/zide05/AdvFact.

pdf bib
On the Effects of Transformer Size on In- and Out-of-Domain Calibration
Soham Dan | Dan Roth

Large, pre-trained transformer language models, which are pervasive in natural language processing tasks, are notoriously expensive to train. To reduce the cost of training such large models, prior work has developed smaller, more compact models which achieves a significant speedup in training time while maintaining competitive accuracy to the original model on downstream tasks. Though these smaller pre-trained models have been widely adopted by the community, it is not known how well are they calibrated compared to their larger counterparts. In this paper, focusing on a wide range of tasks, we thoroughly investigate the calibration properties of pre-trained transformers, as a function of their size. We demonstrate that when evaluated in-domain, smaller models are able to achieve competitive, and often better, calibration compared to larger models, while achieving significant speedup in training time. Post-hoc calibration techniques further reduce calibration error for all models in-domain. However, when evaluated out-of-domain, larger models tend to be better calibrated, and label-smoothing instead is an effective strategy to calibrate models in this setting.

pdf bib
Detecting Polarized Topics Using Partisanship-aware Contextualized Topic Embeddings
Zihao He | Negar Mokhberian | António Câmara | Andres Abeliuk | Kristina Lerman

Growing polarization of the news media has been blamed for fanning disagreement, controversy and even violence. Early identification of polarized topics is thus an urgent matter that can help mitigate conflict. However, accurate measurement of topic-wise polarization is still an open research challenge. To address this gap, we propose Partisanship-aware Contextualized Topic Embeddings (PaCTE), a method to automatically detect polarized topics from partisan news sources. Specifically, utilizing a language model that has been finetuned on recognizing partisanship of the news articles, we represent the ideology of a news corpus on a topic by corpus-contextualized topic embedding and measure the polarization using cosine distance. We apply our method to a dataset of news articles about the COVID-19 pandemic. Extensive experiments on different news sources and topics demonstrate the efficacy of our method to capture topical polarization, as indicated by its effectiveness of retrieving the most polarized topics.

pdf bib
GenerativeRE: Incorporating a Novel Copy Mechanism and Pretrained Model for Joint Entity and Relation Extraction
Jiarun Cao | Sophia Ananiadou

Previous neural Seq2Seq models have shown the effectiveness for jointly extracting relation triplets. However, most of these models suffer from incompletion and disorder problems when they extract multi-token entities from input sentences. To tackle these problems, we propose a generative, multi-task learning framework, named GenerativeRE. We firstly propose a special entity labelling method on both input and output sequences. During the training stage, GenerativeRE fine-tunes the pre-trained generative model and learns the special entity labels simultaneously. During the inference stage, we propose a novel copy mechanism equipped with three mask strategies, to generate the most probable tokens by diminishing the scope of the model decoder. Experimental results show that our model achieves 4.6% and 0.9% F1 score improvements over the current state-of-the-art methods in the NYT24 and NYT29 benchmark datasets respectively.

pdf bib
Re-entry Prediction for Online Conversations via Self-Supervised Learning
Lingzhi Wang | Xingshan Zeng | Huang Hu | Kam-Fai Wong | Daxin Jiang

In recent years, world business in online discussions and opinion sharing on social media is booming. Re-entry prediction task is thus proposed to help people keep track of the discussions which they wish to continue. Nevertheless, existing works only focus on exploiting chatting history and context information, and ignore the potential useful learning signals underlying conversation data, such as conversation thread patterns and repeated engagement of target users, which help better understand the behavior of target users in conversations. In this paper, we propose three interesting and well-founded auxiliary tasks, namely, Spread Pattern, Repeated Target user, and Turn Authorship, as the self-supervised signals for re-entry prediction. These auxiliary tasks are trained together with the main task in a multi-task manner. Experimental results on two datasets newly collected from Twitter and Reddit show that our method outperforms the previous state-of-the-arts with fewer parameters and faster convergence. Extensive experiments and analysis show the effectiveness of our proposed models and also point out some key ideas in designing self-supervised tasks.

pdf bib
proScript: Partially Ordered Scripts Generation
Keisuke Sakaguchi | Chandra Bhagavatula | Ronan Le Bras | Niket Tandon | Peter Clark | Yejin Choi

Scripts – prototypical event sequences describing everyday activities – have been shown to help understand narratives by providing expectations, resolving ambiguity, and filling in unstated information. However, to date they have proved hard to author or extract from text. In this work, we demonstrate for the first time that pre-trained neural language models can be finetuned to generate high-quality scripts, at varying levels of granularity, for a wide range of everyday scenarios (e.g., bake a cake). To do this, we collect a large (6.4k) crowdsourced partially ordered scripts (named proScript), that is substantially larger than prior datasets, and develop models that generate scripts by combining language generation and graph structure prediction. We define two complementary tasks: (i) edge prediction: given a scenario and unordered events, organize the events into a valid (possibly partial-order) script, and (ii) script generation: given only a scenario, generate events and organize them into a (possibly partial-order) script. Our experiments show that our models perform well (e.g., F1=75.7 on task (i)), illustrating a new approach to overcoming previous barriers to script collection. We also show that there is still significant room for improvement toward human level performance. Together, our tasks, dataset, and models offer a new research direction for learning script knowledge.

pdf bib
Speaker Turn Modeling for Dialogue Act Classification
Zihao He | Leili Tavabi | Kristina Lerman | Mohammad Soleymani

Dialogue Act (DA) classification is the task of classifying utterances with respect to the function they serve in a dialogue. Existing approaches to DA classification model utterances without incorporating the turn changes among speakers throughout the dialogue, therefore treating it no different than non-interactive written text. In this paper, we propose to integrate the turn changes in conversations among speakers when modeling DAs. Specifically, we learn conversation-invariant speaker turn embeddings to represent the speaker turns in a conversation; the learned speaker turn embeddings are then merged with the utterance embeddings for the downstream task of DA classification. With this simple yet effective mechanism, our model is able to capture the semantics from the dialogue content while accounting for different speaker turns in a conversation. Validation on three benchmark public datasets demonstrates superior performance of our model.

pdf bib
Unsupervised Domain Adaptation Method with Semantic-Structural Alignment for Dependency Parsing
Boda Lin | Mingzheng Li | Si Li | Yong Luo

Unsupervised cross-domain dependency parsing is to accomplish domain adaptation for dependency parsing without using labeled data in target domain. Existing methods are often of the pseudo-annotation type, which generates data through self-annotation of the base model and performing iterative training. However, these methods fail to consider the change of model structure for domain adaptation. In addition, the structural information contained in the text cannot be fully exploited. To remedy these drawbacks, we propose a Semantics-Structure Adaptative Dependency Parser (SSADP), which accomplishes unsupervised cross-domain dependency parsing without relying on pseudo-annotation or data selection. In particular, we design two feature extractors to extract semantic and structural features respectively. For each type of features, a corresponding feature adaptation method is utilized to achieve domain adaptation to align the domain distribution, which effectively enhances the unsupervised cross-domain transfer capability of the model. We validate the effectiveness of our model by conducting experiments on the CODT1 and CTB9 respectively, and the results demonstrate that our model can achieve consistent performance improvement. Besides, we verify the structure transfer ability of the proposed model by introducing Weisfeiler-Lehman Test.

pdf bib
Devil’s Advocate: Novel Boosting Ensemble Method from Psychological Findings for Text Classification
Hwiyeol Jo | Jaeseo Lim | Byoung-Tak Zhang

We present a new form of ensemble method–Devil’s Advocate, which uses a deliberately dissenting model to force other submodels within the ensemble to better collaborate. Our method consists of two different training settings: one follows the conventional training process (Norm), and the other is trained by artificially generated labels (DevAdv). After training the models, Norm models are fine-tuned through an additional loss function, which uses the DevAdv model as a constraint. In making a final decision, the proposed ensemble model sums the scores of Norm models and then subtracts the score of the DevAdv model. The DevAdv model improves the overall performance of the other models within the ensemble. In addition to our ensemble framework being based on psychological background, it also shows comparable or improved performance on 5 text classification tasks when compared to conventional ensemble methods.

pdf bib
SideControl: Controlled Open-domain Dialogue Generation via Additive Side Networks
Wanyu Du | Yangfeng Ji

Transformer-based pre-trained language models boost the performance of open-domain dialogue systems. Prior works leverage Transformer-based pre-trained language models to generate texts with desired attributes in two general approaches: (1) gradient-based methods: updating all latent representations of pre-trained models with gradients from attribute models; (2) weighted-decoding methods: re-ranking beam candidates from pre-trained models with attribute functions. However, gradient-based methods lead to high computation cost and can easily get overfitted on small training sets, while weighted-decoding methods are inherently constrained by the low-variance high-bias pre-trained model. In this work, we propose a novel approach to control the generation of Transformer-based pre-trained language models: the SideControl framework, which leverages a novel control attributes loss to incorporate useful control signals, and is shown to perform well with very limited training samples. We evaluate our proposed method on two benchmark open-domain dialogue datasets, and results show that the SideControl framework has better controllability, higher generation quality and better sample-efficiency than existing gradient-based and weighted-decoding baselines.

pdf bib
Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of Pre-trained Models’ Transferability
Wei-Tsung Kao | Hung-yi Lee

This paper investigates whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications. To verify pre-trained models’ transferability, we test the pre-trained models on text classification tasks with meanings of tokens mismatches, and real-world non-text token sequence classification data, including amino acid, DNA, and music. We find that even on non-text data, the models pre-trained on text converge faster, perform better than the randomly initialized models, and only slightly worse than the models using task-specific knowledge. We also find that the representations of the text and non-text pre-trained models share non-trivial similarities.

pdf bib
Geo-BERT Pre-training Model for Query Rewriting in POI Search
Xiao Liu | Juan Hu | Qi Shen | Huan Chen

Query Rewriting (QR) is proposed to solve the problem of the word mismatch between queries and documents in Web search. Existing approaches usually model QR with an end-to-end sequence-to-sequence (seq2seq) model. The state-of-the-art Transformer-based models can effectively learn textual semantics from user session logs, but they often ignore users’ geographic location information that is crucial for the Point-of-Interest (POI) search of map services. In this paper, we proposed a pre-training model, called Geo-BERT, to integrate semantics and geographic information in the pre-trained representations of POIs. Firstly, we simulate POI distribution in the real world as a graph, in which nodes represent POIs and multiple geographic granularities. Then we use graph representation learning methods to get geographic representations. Finally, we train a BERT-like pre-training model with text and POIs’ graph embeddings to get an integrated representation of both geographic and semantic information, and apply it in the QR of POI search. The proposed model achieves excellent accuracy on a wide range of real-world datasets of map services.

pdf bib
Leveraging Bidding Graphs for Advertiser-Aware Relevance Modeling in Sponsored Search
Shuxian Bi | Chaozhuo Li | Xiao Han | Zheng Liu | Xing Xie | Haizhen Huang | Zengxuan Wen

Recently, sponsored search has become one of the most lucrative channels for marketing. As the fundamental basis of sponsored search, relevance modeling has attracted increasing attention due to the tremendous practical value. Most existing methods solely rely on the query-keyword pairs. However, keywords are usually short texts with scarce semantic information, which may not precisely reflect the underlying advertising intents. In this paper, we investigate the novel problem of advertiser-aware relevance modeling, which leverages the advertisers’ information to bridge the gap between the search intents and advertising purposes. Our motivation lies in incorporating the unsupervised bidding behaviors as the complementary graphs to learn desirable advertiser representations. We further propose a Bidding-Graph augmented Triple-based Relevance model BGTR with three towers to deeply fuse the bidding graphs and semantic textual data. Empirically, we evaluate the BGTR model over a large industry dataset, and the experimental results consistently demonstrate its superiority.

pdf bib
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
Kang Min Yoo | Dongju Park | Jaewook Kang | Sang-Woo Lee | Woomyoung Park

Large-scale language models such as GPT-3 are excellent few-shot learners, allowing them to be controlled via natural text prompts. Recent studies report that prompt-based direct classification eliminates the need for fine-tuning but lacks data and inference scalability. This paper proposes a novel data augmentation technique that leverages large-scale language models to generate realistic text samples from a mixture of real samples. We also propose utilizing soft-labels predicted by the language models, effectively distilling knowledge from the large-scale language models and creating textual perturbations simultaneously. We perform data augmentation experiments on diverse classification tasks and show that our method hugely outperforms existing text augmentation methods. We also conduct experiments on our newly proposed benchmark to show that the augmentation effect is not only attributed to memorization. Further ablation studies and a qualitative analysis provide more insights into our approach.

pdf bib
Context-aware Entity Typing in Knowledge Graphs
Weiran Pan | Wei Wei | Xian-Ling Mao

Knowledge graph entity typing aims to infer entities’ missing types in knowledge graphs which is an important but under-explored issue. This paper proposes a novel method for this task by utilizing entities’ contextual information. Specifically, we design two inference mechanisms: i) N2T: independently use each neighbor of an entity to infer its type; ii) Agg2T: aggregate the neighbors of an entity to infer its type. Those mechanisms will produce multiple inference results, and an exponentially weighted pooling method is used to generate the final inference result. Furthermore, we propose a novel loss function to alleviate the false-negative problem during training. Experiments on two real-world KGs demonstrate the effectiveness of our method. The source code and data of this paper can be obtained from https://github.com/CCIIPLab/CET.

pdf bib
Attribute Alignment: Controlling Text Generation from Pre-trained Language Models
Dian Yu | Zhou Yu | Kenji Sagae

Large language models benefit from training with a large amount of unlabeled text, which gives them increasingly fluent and diverse generation capabilities. However, using these models for text generation that takes into account target attributes, such as sentiment polarity or specific topics, remains a challenge. We propose a simple and flexible method for controlling text generation by aligning disentangled attribute representations. In contrast to recent efforts on training a discriminator to perturb the token level distribution for an attribute, we use the same data to learn an alignment function to guide the pre-trained, non-controlled language model to generate texts with the target attribute without changing the original language model parameters. We evaluate our method on sentiment- and topic-controlled generation, and show large performance gains over previous methods while retaining fluency and diversity.

pdf bib
Generate & Rank: A Multi-task Framework for Math Word Problems
Jianhao Shen | Yichun Yin | Lin Li | Lifeng Shang | Xin Jiang | Ming Zhang | Qun Liu

Math word problem (MWP) is a challenging and critical task in natural language processing. Many recent studies formalize MWP as a generation task and have adopted sequence-to-sequence models to transform problem descriptions to mathematical expressions. However, mathematical expressions are prone to minor mistakes while the generation objective does not explicitly handle such mistakes. To address this limitation, we devise a new ranking task for MWP and propose Generate & Rank, a multi-task framework based on a generative pre-trained language model. By joint training with generation and ranking, the model learns from its own mistakes and is able to distinguish between correct and incorrect expressions. Meanwhile, we perform tree-based disturbance specially designed for MWP and an online update to boost the ranker. We demonstrate the effectiveness of our proposed method on the benchmark and the results show that our method consistently outperforms baselines in all datasets. Particularly, in the classical Math23k, our method is 7% (78.4% to 85.4%) higher than the state-of-the-art. Code could be found at https://github.com/huawei-noah/noah-research.

pdf bib
MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering
Junjie Wang | Yatai Ji | Jiaqi Sun | Yujiu Yang | Tetsuya Sakai

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intra-modality information. Inspired by this observation, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pre-train MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA-1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

pdf bib