Wei Zhao


pdf bib
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence
Wei Zhao | Michael Strube | Steffen Eger
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics are weak in recognizing coherence, and thus are not reliable in a way to spot the discourse-level improvements of those text generation systems. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level—which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at https://github.com/AIPHES/DiscoScore.


pdf bib
SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language Models
Liang Wang | Wei Zhao | Zhuoyu Wei | Jingming Liu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge graph completion (KGC) aims to reason over known facts and infer the missing links. Text-based methods such as KGBERT (Yao et al., 2019) learn entity representations from natural language descriptions, and have the potential for inductive KGC. However, the performance of text-based methods still largely lag behind graph embedding-based methods like TransE (Bordes et al., 2013) and RotatE (Sun et al., 2019b). In this paper, we identify that the key issue is efficient contrastive learning. To improve the learning efficiency, we introduce three types of negatives: in-batch negatives, pre-batch negatives, and self-negatives which act as a simple form of hard negatives. Combined with InfoNCE loss, our proposed model SimKGC can substantially outperform embedding-based methods on several benchmark datasets. In terms of mean reciprocal rank (MRR), we advance the state-of-the-art by +19% on WN18RR, +6.8% on the Wikidata5M transductive setting, and +22% on the Wikidata5M inductive setting. Thorough analyses are conducted to gain insights into each component. Our code is available at https://github.com/intfloat/SimKGC .


pdf bib
Better than Average: Paired Evaluation of NLP systems
Maxime Peyrard | Wei Zhao | Steffen Eger | Robert West
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instancelevel pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley–Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30% of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

pdf bib
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
Yang Gao | Steffen Eger | Wei Zhao | Piyawat Lertvittayakumjorn | Marina Fomicheva
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

pdf bib
The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and Results
Marina Fomicheva | Piyawat Lertvittayakumjorn | Wei Zhao | Steffen Eger | Yang Gao
Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems

In this paper, we introduce the Eval4NLP-2021 shared task on explainable quality estimation. Given a source-translation pair, this shared task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality. We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results. To the best of our knowledge, this is the first shared task on explainable NLP evaluation metrics. Datasets and results are available at https://github.com/eval4nlp/SharedTask2021.

pdf bib
TUDa at WMT21: Sentence-Level Direct Assessment with Adapters
Gregor Geigle | Jonas Stadtmüller | Wei Zhao | Jonas Pfeiffer | Steffen Eger
Proceedings of the Sixth Conference on Machine Translation

This paper presents our submissions to the WMT2021 Shared Task on Quality Estimation, Task 1 Sentence-Level Direct Assessment. While top-performing approaches utilize massively multilingual Transformer-based language models which have been pre-trained on all target languages of the task, the resulting insights are limited, as it is unclear how well the approach performs on languages unseen during pre-training; more problematically, these approaches do not provide any solutions for extending the model to new languages or unseen scripts—arguably one of the objectives of this shared task. In this work, we thus focus on utilizing massively multilingual language models which only partly cover the target languages during their pre-training phase. We extend the model to new languages and unseen scripts using recent adapter-based methods and achieve on par performance or even surpass models pre-trained on the respective languages.

pdf bib
Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast
Liang Wang | Wei Zhao | Jingming Liu
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose to align sentence representations from different languages into a unified embedding space, where semantic similarities (both cross-lingual and monolingual) can be computed with a simple dot product. Pre-trained language models are fine-tuned with the translation ranking task. Existing work (Feng et al., 2020) uses sentences within the same batch as negatives, which can suffer from the issue of easy negatives. We adapt MoCo (He et al., 2020) to further improve the quality of alignment. As the experimental results show, the sentence representations produced by our model achieve the new state-of-the-art on several tasks, including Tatoeba en-zh similarity search (Artetxe andSchwenk, 2019b), BUCC en-zh bitext mining, and semantic textual similarity on 7 datasets.

pdf bib
Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors
Marvin Kaster | Wei Zhao | Steffen Eger
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Evaluation metrics are a key ingredient for progress of text generation systems. In recent years, several BERT-based evaluation metrics have been proposed (including BERTScore, MoverScore, BLEURT, etc.) which correlate much better with human assessment of text generation quality than BLEU or ROUGE, invented two decades ago. However, little is known what these metrics, which are based on black-box language model representations, actually capture (it is typically assumed they model semantic similarity). In this work, we use a simple regression based global explainability technique to disentangle metric scores along linguistic factors, including semantics, syntax, morphology, and lexical overlap. We show that the different metrics capture all aspects to some degree, but that they are all substantially sensitive to lexical overlap, just like BLEU and ROUGE. This exposes limitations of these novelly proposed metrics, which we also highlight in an adversarial test scenario.

pdf bib
Inducing Language-Agnostic Multilingual Representations
Wei Zhao | Steffen Eger | Johannes Bjerva | Isabelle Augenstein
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT evaluation across 19 typologically diverse languages. Our findings expose the limitations of these approaches—unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches’ additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however.


pdf bib
Evaluation of Coreference Resolution Systems Under Adversarial Attacks
Haixia Chai | Wei Zhao | Steffen Eger | Michael Strube
Proceedings of the First Workshop on Computational Approaches to Discourse

A substantial overlap of coreferent mentions in the CoNLL dataset magnifies the recent progress on coreference resolution. This is because the CoNLL benchmark fails to evaluate the ability of coreference resolvers that requires linking novel mentions unseen at train time. In this work, we create a new dataset based on CoNLL, which largely decreases mention overlaps in the entire dataset and exposes the limitations of published resolvers on two aspects—lexical inference ability and understanding of low-level orthographic noise. Our findings show (1) the requirements for embeddings, used in resolvers, and for coreference resolutions are, by design, in conflict and (2) adversarial approaches are sometimes not legitimate to mitigate the obstacles, as they may falsely introduce mention overlaps in adversarial training and test sets, thus giving an inflated impression for the improvements.

pdf bib
SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization
Yang Gao | Wei Zhao | Steffen Eger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We study unsupervised multi-document summarization evaluation metrics, which require neither human-written reference summaries nor human annotations (e.g. preferences, ratings, etc.). We propose SUPERT, which rates the quality of a summary by measuring its semantic similarity with a pseudo reference summary, i.e. selected salient sentences from the source documents, using contextualized embeddings and soft token alignment techniques. Compared to the state-of-the-art unsupervised evaluation metrics, SUPERT correlates better with human ratings by 18- 39%. Furthermore, we use SUPERT as rewards to guide a neural-based reinforcement learning summarizer, yielding favorable performance compared to the state-of-the-art unsupervised summarizers. All source code is available at https://github.com/yg211/acl20-ref-free-eval.

pdf bib
On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation
Wei Zhao | Goran Glavaš | Maxime Peyrard | Yang Gao | Robert West | Steffen Eger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Evaluation of cross-lingual encoders is usually performed either via zero-shot cross-lingual transfer in supervised downstream tasks or via unsupervised cross-lingual textual similarity. In this paper, we concern ourselves with reference-free machine translation (MT) evaluation where we directly compare source texts to (sometimes low-quality) system translations, which represents a natural adversarial setup for multilingual encoders. Reference-free evaluation holds the promise of web-scale comparison of MT systems. We systematically investigate a range of metrics based on state-of-the-art cross-lingual semantic representations obtained with pretrained M-BERT and LASER. We find that they perform poorly as semantic encoders for reference-free MT evaluation and identify their two key limitations, namely, (a) a semantic mismatch between representations of mutual translations and, more prominently, (b) the inability to punish “translationese”, i.e., low-quality literal translations. We propose two partial remedies: (1) post-hoc re-alignment of the vector spaces and (2) coupling of semantic-similarity based metrics with target-side language modeling. In segment-level MT evaluation, our best metric surpasses reference-based BLEU by 5.7 correlation points.

pdf bib
Regularized Attentive Capsule Network for Overlapped Relation Extraction
Tianyi Liu | Xiangyu Lin | Weijia Jia | Mingliang Zhou | Wei Zhao
Proceedings of the 28th International Conference on Computational Linguistics

Distantly supervised relation extraction has been widely applied in knowledge base construction due to its less requirement of human efforts. However, the automatically established training datasets in distant supervision contain low-quality instances with noisy words and overlapped relations, introducing great challenges to the accurate extraction of relations. To address this problem, we propose a novel Regularized Attentive Capsule Network (RA-CapNet) to better identify highly overlapped relations in each informal sentence. To discover multiple relation features in an instance, we embed multi-head attention into the capsule network as the low-level capsules, where the subtraction of two entities acts as a new form of relation query to select salient features regardless of their positions. To further discriminate overlapped relation features, we devise disagreement regularization to explicitly encourage the diversity among both multiple attention heads and low-level capsules. Extensive experiments conducted on widely used datasets show that our model achieves significant improvements in relation extraction.

pdf bib
Active Testing: An Unbiased Evaluation Method for Distantly Supervised Relation Extraction
Pengshuai Li | Xinsong Zhang | Weijia Jia | Wei Zhao
Findings of the Association for Computational Linguistics: EMNLP 2020

Distant supervision has been a widely used method for neural relation extraction for its convenience of automatically labeling datasets. However, existing works on distantly supervised relation extraction suffer from the low quality of test set, which leads to considerable biased performance evaluation. These biases not only result in unfair evaluations but also mislead the optimization of neural relation extraction. To mitigate this problem, we propose a novel evaluation method named active testing through utilizing both the noisy test set and a few manual annotations. Experiments on a widely used benchmark show that our proposed approach can yield approximately unbiased evaluations for distantly supervised relation extractors.

pdf bib
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Steffen Eger | Yang Gao | Maxime Peyrard | Wei Zhao | Eduard Hovy
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems


pdf bib
Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications
Wei Zhao | Haiyun Peng | Steffen Eger | Erik Cambria | Min Yang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Obstacles hindering the development of capsule networks for challenging NLP applications include poor scalability to large output spaces and less reliable routing processes. In this paper, we introduce: (i) an agreement score to evaluate the performance of routing processes at instance-level; (ii) an adaptive optimizer to enhance the reliability of routing; (iii) capsule compression and partial routing to improve the scalability of capsule networks. We validate our approach on two NLP tasks, namely: multi-label text classification and question answering. Experimental results show that our approach considerably improves over strong competitors on both tasks. In addition, we gain the best results in low-resource settings with few training instances.

pdf bib
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance
Wei Zhao | Maxime Peyrard | Fei Liu | Yang Gao | Christian M. Meyer | Steffen Eger
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

A robust evaluation metric has a profound impact on the development of text generation systems. A desirable metric compares system output against references based on their semantics rather than surface forms. In this paper we investigate strategies to encode system and reference texts to devise a metric that shows a high correlation with human judgment of text quality. We validate our new metric, namely MoverScore, on a number of text generation tasks including summarization, machine translation, image captioning, and data-to-text generation, where the outputs are produced by a variety of neural and non-neural systems. Our findings suggest that metrics combining contextualized representations with a distance measure perform the best. Such metrics also demonstrate strong generalization capability across tasks. For ease-of-use we make our metrics available as web service.

pdf bib
Denoising based Sequence-to-Sequence Pre-training for Text Generation
Liang Wang | Wei Zhao | Ruoyu Jia | Sujian Li | Jingming Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

This paper presents a new sequence-to-sequence (seq2seq) pre-training method PoDA (Pre-training of Denoising Autoencoders), which learns representations suitable for text generation tasks. Unlike encoder-only (e.g., BERT) or decoder-only (e.g., OpenAI GPT) pre-training approaches, PoDA jointly pre-trains both the encoder and decoder by denoising the noise-corrupted text, and it also has the advantage of keeping the network architecture unchanged in the subsequent fine-tuning stage. Meanwhile, we design a hybrid model of Transformer and pointer-generator networks as the backbone architecture for PoDA. We conduct experiments on two text generation tasks: abstractive summarization, and grammatical error correction. Results on four datasets show that PoDA can improve model performance over strong baselines without using any task-specific techniques and significantly speed up convergence.

pdf bib
Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data
Wei Zhao | Liang Wang | Kewei Shen | Ruoyu Jia | Jingming Liu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Neural machine translation systems have become state-of-the-art approaches for Grammatical Error Correction (GEC) task. In this paper, we propose a copy-augmented architecture for the GEC task by copying the unchanged words from the source sentence to the target sentence. Since the GEC suffers from not having enough labeled training data to achieve high accuracy. We pre-train the copy-augmented architecture with a denoising auto-encoder using the unlabeled One Billion Benchmark and make comparisons between the fully pre-trained model and a partially pre-trained model. It is the first time copying words from the source context and fully pre-training a sequence to sequence model are experimented on the GEC task. Moreover, We add token-level and sentence-level multi-task learning for the GEC task. The evaluation results on the CoNLL-2014 test set show that our approach outperforms all recently published state-of-the-art results by a large margin.


pdf bib
Investigating Capsule Networks with Dynamic Routing for Text Classification
Wei Zhao | Jianbo Ye | Min Yang | Zeyang Lei | Suofei Zhang | Zhou Zhao
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this study, we explore capsule networks with dynamic routing for text classification. We propose three strategies to stabilize the dynamic routing process to alleviate the disturbance of some noise capsules which may contain “background” information or have not been successfully trained. A series of experiments are conducted with capsule networks on six text classification benchmarks. Capsule networks achieve state of the art on 4 out of 6 datasets, which shows the effectiveness of capsule networks for text classification. We additionally show that capsule networks exhibit significant improvement when transfer single-label to multi-label text classification over strong baseline methods. To the best of our knowledge, this is the first work that capsule networks have been empirically investigated for text modeling.

pdf bib
Yuanfudao at SemEval-2018 Task 11: Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension
Liang Wang | Meng Sun | Wei Zhao | Kewei Shen | Jingming Liu
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes our system for SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge. We use Three-way Attentive Networks (TriAN) to model interactions between the passage, question and answers. To incorporate commonsense knowledge, we augment the input with relation embedding from the graph of general knowledge ConceptNet. As a result, our system achieves state-of-the-art performance with 83.95% accuracy on the official test data. Code is publicly available at https://github.com/intfloat/commonsense-rc.

pdf bib
Multi-Perspective Context Aggregation for Semi-supervised Cloze-style Reading Comprehension
Liang Wang | Sujian Li | Wei Zhao | Kewei Shen | Meng Sun | Ruoyu Jia | Jingming Liu
Proceedings of the 27th International Conference on Computational Linguistics

Cloze-style reading comprehension has been a popular task for measuring the progress of natural language understanding in recent years. In this paper, we design a novel multi-perspective framework, which can be seen as the joint training of heterogeneous experts and aggregate context information from different perspectives. Each perspective is modeled by a simple aggregation module. The outputs of multiple aggregation modules are fed into a one-timestep pointer network to get the final answer. At the same time, to tackle the problem of insufficient labeled data, we propose an efficient sampling mechanism to automatically generate more training examples by matching the distribution of candidates between labeled and unlabeled data. We conduct our experiments on a recently released cloze-test dataset CLOTH (Xie et al., 2017), which consists of nearly 100k questions designed by professional teachers. Results show that our method achieves new state-of-the-art performance over previous strong baselines.

pdf bib
Aspect and Sentiment Aware Abstractive Review Summarization
Min Yang | Qiang Qu | Ying Shen | Qiao Liu | Wei Zhao | Jia Zhu
Proceedings of the 27th International Conference on Computational Linguistics

Review text has been widely studied in traditional tasks such as sentiment analysis and aspect extraction. However, to date, no work is towards the abstractive review summarization that is essential for business organizations and individual consumers to make informed decisions. This work takes the lead to study the aspect/sentiment-aware abstractive review summarization by exploring multi-factor attentions. Specifically, we propose an interactive attention mechanism to interactively learns the representations of context words, sentiment words and aspect words within the reviews, acted as an encoder. The learned sentiment and aspect representations are incorporated into the decoder to generate aspect/sentiment-aware review summaries via an attention fusion network. In addition, the abstractive summarizer is jointly trained with the text categorization task, which helps learn a category-specific text encoder, locating salient aspect information and exploring the variations of style and wording of content with respect to different text categories. The experimental results on a real-life dataset demonstrate that our model achieves impressive results compared to other strong competitors.


pdf bib
Identifying and Tracking Sentiments and Topics from Social Media Texts during Natural Disasters
Min Yang | Jincheng Mei | Heng Ji | Wei Zhao | Zhou Zhao | Xiaojun Chen
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We study the problem of identifying the topics and sentiments and tracking their shifts from social media texts in different geographical regions during emergencies and disasters. We propose a location-based dynamic sentiment-topic model (LDST) which can jointly model topic, sentiment, time and Geolocation information. The experimental results demonstrate that LDST performs very well at discovering topics and sentiments from social media and tracking their shifts in different geographical regions during emergencies and disasters. We will release the data and source code after this work is published.