Bingyu Zhu


pdf bib
Syntax-guided Localized Self-attention by Constituency Syntactic Distance
Shengyuan Hou | Jushi Kai | Haotian Xue | Bingyu Zhu | Bo Yuan | Longtao Huang | Xinbing Wang | Zhouhan Lin
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent works have revealed that Transformers are implicitly learning the syntactic information in its lower layers from data, albeit is highly dependent on the quality and scale of the training data. However, learning syntactic information from data is not necessary if we can leverage an external syntactic parser, which provides better parsing quality with well-defined syntactic structures. This could potentially improve Transformer’s performance and sample efficiency. In this work, we propose a syntax-guided localized self-attention for Transformer that allows directly incorporating grammar structures from an external constituency parser. It prohibits the attention mechanism to overweight the grammatically distant tokens over close ones. Experimental results show that our model could consistently improve translation performance on a variety of machine translation datasets, ranging from small to large dataset sizes, and with different source languages.


pdf bib
Counterfactual Adversarial Learning with Representation Interpolation
Wei Wang | Boxin Wang | Ning Shi | Jinfeng Li | Bingyu Zhu | Xiangyu Liu | Rong Zhang
Findings of the Association for Computational Linguistics: EMNLP 2021

Deep learning models exhibit a preference for statistical fitting over logical reasoning. Spurious correlations might be memorized when there exists statistical bias in training data, which severely limits the model performance especially in small data scenarios. In this work, we introduce Counterfactual Adversarial Training framework (CAT) to tackle the problem from a causality perspective. Particularly, for a specific sample, CAT first generates a counterfactual representation through latent space interpolation in an adversarial manner, and then performs Counterfactual Risk Minimization (CRM) on each original-counterfactual pair to adjust sample-wise loss weight dynamically, which encourages the model to explore the true causal effect. Extensive experiments demonstrate that CAT achieves substantial performance improvement over SOTA across different downstream tasks, including sentence classification, natural language inference and question answering.