Hao Wu


2022

pdf bib
Compilable Neural Code Generation with Compiler Feedback
Xin Wang | Yasheng Wang | Yao Wan | Fei Mi | Yitong Li | Pingyi Zhou | Jin Liu | Hao Wu | Xin Jiang | Qun Liu
Findings of the Association for Computational Linguistics: ACL 2022

Automatically generating compilable programs with (or without) natural language descriptions has always been a touchstone problem for computational linguistics and automated software engineering. Existing deep-learning approaches model code generation as text generation, either constrained by grammar structures in decoder, or driven by pre-trained language models on large-scale code corpus (e.g., CodeGPT, PLBART, and CodeT5). However, few of them account for compilability of the generated programs. To improve compilability of the generated programs, this paper proposes COMPCODER, a three-stage pipeline utilizing compiler feedback for compilable code generation, including language model fine-tuning, compilability reinforcement, and compilability discrimination. Comprehensive experiments on two code generation tasks demonstrate the effectiveness of our proposed approach, improving the success rate of compilation from 44.18 to 89.18 in code completion on average and from 70.3 to 96.2 in text-to-code generation, respectively, when comparing with the state-of-the-art CodeGPT.

pdf bib
CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training
Xin Wang | Yasheng Wang | Yao Wan | Jiawei Wang | Pingyi Zhou | Li Li | Hao Wu | Jin Liu
Findings of the Association for Computational Linguistics: NAACL 2022

Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics of source code from different views, including plain text, Abstract Syntax Tree (AST), and several kinds of code graphs (e.g., Control/Data Flow Graph). However, most of them only consider a single view of source code independently, ignoring the correspondences among different views. In this paper, we propose to integrate different views with the natural-language description of source code into a unified framework with Multi-View contrastive Pre-training, and name our model as CODE-MVP. Specifically, we first extract multiple code views using compiler tools, and learn the complementary information among them under a contrastive learning framework. Inspired by the type checking in compilation, we also design a fine-grained type inference objective in the pre-training. Experiments on three downstream tasks over five datasets demonstrate the superiority of CODE-MVP when compared with several state-of-the-art baselines. For example, we achieve 2.4/2.3/1.1 gain in terms of MRR/MAP/Accuracy metrics on natural language code retrieval, code similarity, and code defect detection tasks, respectively.

pdf bib
LiGCN: Label-interpretable Graph Convolutional Networks for Multi-label Text Classification
Irene Li | Aosong Feng | Hao Wu | Tianxiao Li | Toyotaro Suzumura | Ruihai Dong
Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022)

Multi-label text classification (MLTC) is an attractive and challenging task in natural language processing (NLP). Compared with single-label text classification, MLTC has a wider range of applications in practice. In this paper, we propose a label-interpretable graph convolutional network model to solve the MLTC problem by modeling tokens and labels as nodes in a heterogeneous graph. In this way, we are able to take into account multiple relationships including token-level relationships. Besides, the model allows better interpretability for predicted labels as the token-label edges are exposed. We evaluate our method on four real-world datasets and it achieves competitive scores against selected baseline methods. Specifically, this model achieves a gain of 0.14 on the F1 score in the small label set MLTC, and 0.07 in the large label set scenario.

pdf bib
A Meta-framework for Spatiotemporal Quantity Extraction from Text
Qiang Ning | Ben Zhou | Hao Wu | Haoruo Peng | Chuchu Fan | Matt Gardner
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

News events are often associated with quantities (e.g., the number of COVID-19 patients or the number of arrests in a protest), and it is often important to extract their type, time, and location from unstructured text in order to analyze these quantity events. This paper thus formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it. This meta-framework contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models. We demonstrate the meta-framework in three domains—the COVID-19 pandemic, Black Lives Matter protests, and 2020 California wildfires—to show that the formalism is general and extensible, the crowdsourcing pipeline facilitates fast and high-quality data annotation, and the baseline system can handle spatiotemporal quantity extraction well enough to be practically useful. We release all resources for future research on this topic at https://github.com/steqe.

pdf bib
Exploiting Unlabeled Data for Target-Oriented Opinion Words Extraction
Yidong Wang | Hao Wu | Ao Liu | Wenxin Hou | Zhen Wu | Jindong Wang | Takahiro Shinozaki | Manabu Okumura | Yue Zhang
Proceedings of the 29th International Conference on Computational Linguistics

Target-oriented Opinion Words Extraction (TOWE) is a fine-grained sentiment analysis task that aims to extract the corresponding opinion words of a given opinion target from the sentence. Recently, deep learning approaches have made remarkable progress on this task. Nevertheless, the TOWE task still suffers from the scarcity of training data due to the expensive data annotation process. Limited labeled data increase the risk of distribution shift between test data and training data. In this paper, we propose exploiting massive unlabeled data to reduce the risk by increasing the exposure of the model to varying distribution shifts. Specifically, we propose a novel Multi-Grained Consistency Regularization (MGCR) method to make use of unlabeled data and design two filters specifically for TOWE to filter noisy data at different granularity. Extensive experimental results on four TOWE benchmark datasets indicate the superiority of MGCR compared with current state-of-the-art methods. The in-depth analysis also demonstrates the effectiveness of the different-granularity filters.

2021

pdf bib
Cold Start Problem For Automated Live Video Comments
Hao Wu | François Pitie | Gareth Jones
Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Live video comments, or ”danmu”, are an emerging feature on Asian online video platforms. Danmu are time-synchronous comments that are overlaid on a video playback. These comments uniquely enrich the experience and engagement of their users. These comments have become a determining factor in the popularity of the videos. Similar to the ”cold start problem” in recommender systems, a video will only start to attract attention when sufficient danmu comments have been posted on it. We study this video cold start problem and examine how new comments can be generated automatically on less-commented videos. We propose to predict the danmu comments by exploiting a multi-modal combination of the video visual content, subtitles, audio signals, and any surrounding comments (when they exist). Our method fuses these multi-modalities in a transformer network which is then trained for different comment density scenarios. We evaluate our proposed system through both a retrieval based evaluation method, as well as human judgement. Results show that our proposed system improves significantly over state-of-the-art methods.

pdf bib
Fusing Label Embedding into BERT: An Efficient Improvement for Text Classification
Yijin Xiong | Yukun Feng | Hao Wu | Hidetaka Kamigaito | Manabu Okumura
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
XRJL-HKUST at SemEval-2021 Task 4: WordNet-Enhanced Dual Multi-head Co-Attention for Reading Comprehension of Abstract Meaning
Yuxin Jiang | Ziyi Shou | Qijun Wang | Hao Wu | Fangzhen Lin
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents our submitted system to SemEval 2021 Task 4: Reading Comprehension of Abstract Meaning. Our system uses a large pre-trained language model as the encoder and an additional dual multi-head co-attention layer to strengthen the relationship between passages and question-answer pairs, following the current state-of-the-art model DUMA. The main difference is that we stack the passage-question and question-passage attention modules instead of calculating parallelly to simulate re-considering process. We also add a layer normalization module to improve the performance of our model. Furthermore, to incorporate our known knowledge about abstract concepts, we retrieve the definitions of candidate answers from WordNet and feed them to the model as extra inputs. Our system, called WordNet-enhanced DUal Multi-head Co-Attention (WN-DUMA), achieves 86.67% and 89.99% accuracy on the official blind test set of subtask 1 and subtask 2 respectively.

2020

pdf bib
TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions
Qiang Ning | Hao Wu | Rujun Han | Nanyun Peng | Matt Gardner | Dan Roth
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as “what happened before/after [some event]?” We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.

pdf bib
Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction
Xu Zhao | Zihao Wang | Hao Wu | Yong Zhang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Semi-supervision is a promising paradigm for Bilingual Lexicon Induction (BLI) with limited annotations. However, previous semisupervised methods do not fully utilize the knowledge hidden in annotated and nonannotated data, which hinders further improvement of their performance. In this paper, we propose a new semi-supervised BLI framework to encourage the interaction between the supervised signal and unsupervised alignment. We design two message-passing mechanisms to transfer knowledge between annotated and non-annotated data, named prior optimal transport and bi-directional lexicon update respectively. Then, we perform semi-supervised learning based on a cyclic or a parallel parameter feeding routine to update our models. Our framework is a general framework that can incorporate any supervised and unsupervised BLI methods based on optimal transport. Experimental results on MUSE and VecMap datasets show significant improvement of our models. Ablation study also proves that the two-way interaction between the supervised signal and unsupervised alignment accounts for the gain of the overall performance. Results on distant language pairs further illustrate the advantage and robustness of our proposed method.

pdf bib
Easy, Reproducible and Quality-Controlled Data Collection with CROWDAQ
Qiang Ning | Hao Wu | Pradeep Dasigi | Dheeru Dua | Matt Gardner | Robert L. Logan IV | Ana Marasović | Zhen Nie
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

High-quality and large-scale data are key to success for AI systems. However, large-scale data annotation efforts are often confronted with a set of common challenges: (1) designing a user-friendly annotation interface; (2) training enough annotators efficiently; and (3) reproducibility. To address these problems, we introduce CROWDAQ, an open-source platform that standardizes the data collection pipeline with customizable user-interface components, automated annotator qualification, and saved pipelines in a re-usable format. We show that CROWDAQ simplifies data annotation significantly on a diverse set of data collection use cases and we hope it will be a convenient tool for the community.

pdf bib
Warren at SemEval-2020 Task 4: ALBERT and Multi-Task Learning for Commonsense Validation
Yuhang Wu | Hao Wu
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our system in subtask A of SemEval 2020 Shared Task 4. We propose a reinforcement learning model based on MTL(Multi-Task Learning) to enhance the prediction ability of commonsense validation. The experimental results demonstrate that our system outperforms the single-task text classification model. We combine MTL and ALBERT pretrain model to achieve an accuracy of 0.904 and our model is ranked 16th on the final leader board of the competition among the 45 teams.

pdf bib
A Relaxed Matching Procedure for Unsupervised BLI
Xu Zhao | Zihao Wang | Yong Zhang | Hao Wu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recently unsupervised Bilingual Lexicon Induction(BLI) without any parallel corpus has attracted much research interest. One of the crucial parts in methods for the BLI task is the matching procedure. Previous works impose a too strong constraint on the matching and lead to many counterintuitive translation pairings. Thus We propose a relaxed matching procedure to find a more precise matching between two languages. We also find that aligning source and target language embedding space bidirectionally will bring significant improvement. We follow the previous iterative framework to conduct experiments. Results on standard benchmark demonstrate the effectiveness of our proposed method, which substantially outperforms previous unsupervised methods.

2019

pdf bib
ZQM at SemEval-2019 Task9: A Single Layer CNN Based on Pre-trained Model for Suggestion Mining
Qimin Zhou | Zhengxin Zhang | Hao Wu | Linmao Wang
Proceedings of the 13th International Workshop on Semantic Evaluation

This paper describes our system that competed at SemEval 2019 Task 9 - SubTask A: ”Sug- gestion Mining from Online Reviews and Forums”. Our system fuses the convolutional neural network and the latest BERT model to conduct suggestion mining. In our system, the input of convolutional neural network is the embedding vectors which are drawn from the pre-trained BERT model. And to enhance the effectiveness of the whole system, the pre-trained BERT model is fine-tuned by provided datasets before the procedure of embedding vectors extraction. Empirical results show the effectiveness of our model which obtained 9th position out of 34 teams with F1 score equals to 0.715.

pdf bib
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds
John P. Lalor | Hao Wu | Hong Yu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.

2018

pdf bib
NLP at IEST 2018: BiLSTM-Attention and LSTM-Attention via Soft Voting in Emotion Classification
Qimin Zhou | Hao Wu
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper describes our method that competed at WASSA2018 Implicit Emotion Shared Task. The goal of this task is to classify the emotions of excluded words in tweets into six different classes: sad, joy, disgust, surprise, anger and fear. For this, we examine a BiLSTM architecture with attention mechanism (BiLSTM-Attention) and a LSTM architecture with attention mechanism (LSTM-Attention), and try different dropout rates based on these two models. We then exploit an ensemble of these methods to give the final prediction which improves the model performance significantly compared with the baseline model. The proposed method achieves 7th position out of 30 teams and outperforms the baseline method by 12.5% in terms of macro F1.

pdf bib
NLPZZX at SemEval-2018 Task 1: Using Ensemble Method for Emotion and Sentiment Intensity Determination
Zhengxin Zhang | Qimin Zhou | Hao Wu
Proceedings of the 12th International Workshop on Semantic Evaluation

In this paper, we put forward a system that competed at SemEval-2018 Task 1: “Affect in Tweets”. Our system uses a simple yet effective ensemble method which combines several neural network components. We participate in two subtasks for English tweets: EI-reg and V-reg. For two subtasks, different combinations of neural components are examined. For EI-reg, our system achieves an accuracy of 0.727 in Pearson Correlation Coefficient (all instances) and an accuracy of 0.555 in Pearson Correlation Coefficient (0.5-1). For V-reg, the achieved accuracy scores are respectively 0.835 and 0.670

pdf bib
Zewen at SemEval-2018 Task 1: An Ensemble Model for Affect Prediction in Tweets
Zewen Chi | Heyan Huang | Jiangui Chen | Hao Wu | Ran Wei
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper presents a method for Affect in Tweets, which is the task to automatically determine the intensity of emotions and intensity of sentiment of tweets. The term affect refers to emotion-related categories such as anger, fear, etc. Intensity of emo-tions need to be quantified into a real valued score in [0, 1]. We propose an en-semble system including four different deep learning methods which are CNN, Bidirectional LSTM (BLSTM), LSTM-CNN and a CNN-based Attention model (CA). Our system gets an average Pearson correlation score of 0.682 in the subtask EI-reg and an average Pearson correlation score of 0.784 in subtask V-reg, which ranks 17th among 48 systems in EI-reg and 19th among 38 systems in V-reg.

pdf bib
Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study
John P. Lalor | Hao Wu | Tsendsuren Munkhdalai | Hong Yu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. In this work we examine the impact of a test set question’s difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question’s difficulty. In addition, as DNNs are trained on larger datasets easy questions start to have a higher probability of being answered correctly than harder questions.

pdf bib
Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource
Qiang Ning | Hao Wu | Haoruo Peng | Dan Roth
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Extracting temporal relations (before, after, overlapping, etc.) is a key aspect of understanding events described in natural language. We argue that this task would gain from the availability of a resource that provides prior knowledge in the form of the temporal order that events usually follow. This paper develops such a resource – a probabilistic knowledge base acquired in the news domain – by extracting temporal relations between events from the New York Times (NYT) articles over a 20-year span (1987–2007). We show that existing temporal extraction systems can be improved via this resource. As a byproduct, we also show that interesting statistics can be retrieved from this resource, which can potentially benefit other time-aware tasks. The proposed system and resource are both publicly available.

pdf bib
A Multi-Axis Annotation Scheme for Event Temporal Relations
Qiang Ning | Hao Wu | Dan Roth
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing temporal relation (TempRel) annotation schemes often have low inter-annotator agreements (IAA) even between experts, suggesting that the current annotation task needs a better definition. This paper proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only. A pilot expert annotation effort using the proposed scheme shows significant improvement in IAA from the conventional 60’s to 80’s (Cohen’s Kappa). This better-defined annotation scheme further enables the use of crowdsourcing to alleviate the labor intensity for each annotator. We hope that this work can foster more interesting studies towards event understanding.

pdf bib
Joint Reasoning for Temporal and Causal Relations
Qiang Ning | Zhili Feng | Hao Wu | Dan Roth
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding temporal and causal relations between events is a fundamental natural language understanding task. Because a cause must occur earlier than its effect, temporal and causal relations are closely related and one relation often dictates the value of the other. However, limited attention has been paid to studying these two relations jointly. This paper presents a joint inference framework for them using constrained conditional models (CCMs). Specifically, we formulate the joint problem as an integer linear programming (ILP) problem, enforcing constraints that are inherent in the nature of time and causality. We show that the joint inference framework results in statistically significant improvement in the extraction of both temporal and causal relations from text.

2017

pdf bib
BIT at SemEval-2017 Task 1: Using Semantic Information Space to Evaluate Semantic Textual Similarity
Hao Wu | Heyan Huang | Ping Jian | Yuhang Guo | Chao Su
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This paper presents three systems for semantic textual similarity (STS) evaluation at SemEval-2017 STS task. One is an unsupervised system and the other two are supervised systems which simply employ the unsupervised one. All our systems mainly depend on the (SIS), which is constructed based on the semantic hierarchical taxonomy in WordNet, to compute non-overlapping information content (IC) of sentences. Our team ranked 2nd among 31 participating teams by the primary score of Pearson correlation coefficient (PCC) mean of 7 tracks and achieved the best performance on Track 1 (AR-AR) dataset.

pdf bib
A Parallel Recurrent Neural Network for Language Modeling with POS Tags
Chao Su | Heyan Huang | Shumin Shi | Yuhang Guo | Hao Wu
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

2016

pdf bib
BIT at SemEval-2016 Task 1: Sentence Similarity Based on Alignments and Vector with the Weight of Information Content
Hao Wu | Heyan Huang | Wenpeng Lu
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib
Building an Evaluation Scale using Item Response Theory
John P. Lalor | Hao Wu | Hong Yu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
ILLINOISCLOUDNLP: Text Analytics Services in the Cloud
Hao Wu | Zhiye Fei | Aaron Dai | Mark Sammons | Dan Roth | Stephen Mayhew
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Natural Language Processing (NLP) continues to grow in popularity in a range of research and commercial applications. However, installing, maintaining, and running NLP tools can be time consuming, and many commercial and research end users have only intermittent need for large processing capacity. This paper describes ILLINOISCLOUDNLP, an on-demand framework built around NLPCURATOR and Amazon Web Services’ Elastic Compute Cloud (EC2). This framework provides a simple interface to end users via which they can deploy one or more NLPCURATOR instances on EC2, upload plain text documents, specify a set of Text Analytics tools (NLP annotations) to apply, and process and store or download the processed data. It can also allow end users to use a model trained on their own data: ILLINOISCLOUDNLP takes care of training, hosting, and applying it to new data just as it does with existing models within NLPCURATOR. As a representative use case, we describe our use of ILLINOISCLOUDNLP to process 3.05 million documents used in the 2012 and 2013 Text Analysis Conference Knowledge Base Population tasks at a relatively deep level of processing, in approximately 20 hours, at an approximate cost of US$500; this is about 20 times faster than doing so on a single server and requires no human supervision and no NLP or Machine Learning expertise.