Rui Wang


2021

pdf bib
Advances and Challenges in Unsupervised Neural Machine Translation
Rui Wang | Hai Zhao
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Unsupervised cross-lingual language representation initialization methods, together with mechanisms such as denoising and back-translation, have advanced unsupervised neural machine translation (UNMT), which has achieved impressive results. Meanwhile, there are still several challenges for UNMT. This tutorial first introduces the background and the latest progress of UNMT. We then examine a number of challenges to UNMT and give empirical results on how well the technology currently holds up.

pdf bib
A Unified Span-Based Approach for Opinion Mining with Syntactic Constituents
Qingrong Xia | Bo Zhang | Rui Wang | Zhenghua Li | Yue Zhang | Fei Huang | Luo Si | Min Zhang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Fine-grained opinion mining (OM) has achieved increasing attraction in the natural language processing (NLP) community, which aims to find the opinion structures of “Who expressed what opinions towards what” in one sentence. In this work, motivated by its span-based representations of opinion expressions and roles, we propose a unified span-based approach for the end-to-end OM setting. Furthermore, inspired by the unified span-based formalism of OM and constituent parsing, we explore two different methods (multi-task learning and graph convolutional neural network) to integrate syntactic constituents into the proposed model to help OM. We conduct experiments on the commonly used MPQA 2.0 dataset. The experimental results show that our proposed unified span-based approach achieves significant improvements over previous works in the exact F1 score and reduces the number of wrongly-predicted opinion expressions and roles, showing the effectiveness of our method. In addition, incorporating the syntactic constituents achieves promising improvements over the strong baseline enhanced by contextualized word representations.

pdf bib
Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems.

pdf bib
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training
Mingliang Zeng | Xu Tan | Rui Wang | Zeqian Ju | Tao Qin | Tie-Yan Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
SentiX: A Sentiment-Aware Pre-Trained Model for Cross-Domain Sentiment Analysis
Jie Zhou | Junfeng Tian | Rui Wang | Yuanbin Wu | Wenming Xiao | Liang He
Proceedings of the 28th International Conference on Computational Linguistics

Pre-trained language models have been widely applied to cross-domain NLP tasks like sentiment analysis, achieving state-of-the-art performance. However, due to the variety of users’ emotional expressions across domains, fine-tuning the pre-trained models on the source domain tends to overfit, leading to inferior results on the target domain. In this paper, we pre-train a sentiment-aware language model (SentiX) via domain-invariant sentiment knowledge from large-scale review datasets, and utilize it for cross-domain sentiment analysis task without fine-tuning. We propose several pre-training tasks based on existing lexicons and annotations at both token and sentence levels, such as emoticons, sentiment words, and ratings, without human interference. A series of experiments are conducted and the results indicate the great advantages of our model. We obtain new state-of-the-art results in all the cross-domain sentiment analysis tasks, and our proposed SentiX can be trained with only 1% samples (18 samples) and it achieves better performance than BERT with 90% samples.

pdf bib
Semantic Role Labeling with Heterogeneous Syntactic Knowledge
Qingrong Xia | Rui Wang | Zhenghua Li | Yue Zhang | Min Zhang
Proceedings of the 28th International Conference on Computational Linguistics

Recently, due to the interplay between syntax and semantics, incorporating syntactic knowledge into neural semantic role labeling (SRL) has achieved much attention. Most of the previous syntax-aware SRL works focus on explicitly modeling homogeneous syntactic knowledge over tree outputs. In this work, we propose to encode heterogeneous syntactic knowledge for SRL from both explicit and implicit representations. First, we introduce graph convolutional networks to explicitly encode multiple heterogeneous dependency parse trees. Second, we extract the implicit syntactic representations from syntactic parser trained with heterogeneous treebanks. Finally, we inject the two types of heterogeneous syntax-aware representations into the base SRL model as extra inputs. We conduct experiments on two widely-used benchmark datasets, i.e., Chinese Proposition Bank 1.0 and English CoNLL-2005 dataset. Experimental results show that incorporating heterogeneous syntactic knowledge brings significant improvements over strong baselines. We further conduct detailed analysis to gain insights on the usefulness of heterogeneous (vs. homogeneous) syntactic knowledge and the effectiveness of our proposed approaches for modeling such knowledge.

pdf bib
Robust Unsupervised Neural Machine Translation with Adversarial Denoising Training
Haipeng Sun | Rui Wang | Kehai Chen | Xugang Lu | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics

Unsupervised neural machine translation (UNMT) has recently attracted great interest in the machine translation community. The main advantage of the UNMT lies in its easy collection of required large training text sentences while with only a slightly worse performance than supervised neural machine translation which requires expensive annotated translation pairs on some translation tasks. In most studies, the UMNT is trained with clean data without considering its robustness to the noisy data. However, in real-world scenarios, there usually exists noise in the collected input sentences which degrades the performance of the translation system since the UNMT is sensitive to the small perturbations of the input sentences. In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems. First of all, we clearly defined two types of noises in training sentences, i.e., word noise and word order noise, and empirically investigate its effect in the UNMT, then we propose adversarial training methods with denoising process in the UNMT. Experimental results on several language pairs show that our proposed methods substantially improved the robustness of the conventional UNMT systems in noisy scenarios.

pdf bib
Chinese Grammatical Error Diagnosis with Graph Convolution Network and Multi-task Learning
Yikang Luo | Zuyi Bao | Chen Li | Rui Wang
Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications

This paper describes our participating system on the Chinese Grammatical Error Diagnosis (CGED) 2020 shared task. For the detection subtask, we propose two BERT-based approaches 1) with syntactic dependency trees enhancing the model performance and 2) under the multi-task learning framework to combine the sequence labeling and the sequence-to-sequence (seq2seq) models. For the correction subtask, we utilize the masked language model, the seq2seq model and the spelling check model to generate corrections based on the detection results. Finally, our system achieves the highest recall rate on the top-3 correction and the second best F1 score on identification level and position level.

pdf bib
High-order Semantic Role Labeling
Zuchao Li | Hai Zhao | Rui Wang | Kevin Parnow
Findings of the Association for Computational Linguistics: EMNLP 2020

Semantic role labeling is primarily used to identify predicates, arguments, and their semantic relationships. Due to the limitations of modeling methods and the conditions of pre-identified predicates, previous work has focused on the relationships between predicates and arguments and the correlations between arguments at most, while the correlations between predicates have been neglected for a long time. High-order features and structure learning were very common in modeling such correlations before the neural network era. In this paper, we introduce a high-order graph structure for the neural semantic role labeling model, which enables the model to explicitly consider not only the isolated predicate-argument pairs but also the interaction between the predicate-argument pairs. Experimental results on 7 languages of the CoNLL-2009 benchmark show that the high-order structural learning techniques are beneficial to the strong performing SRL models and further boost our baseline to achieve new state-of-the-art results.

pdf bib
Chunk-based Chinese Spelling Check with Global Optimization
Zuyi Bao | Chen Li | Rui Wang
Findings of the Association for Computational Linguistics: EMNLP 2020

Chinese spelling check is a challenging task due to the characteristics of the Chinese language, such as the large character set, no word boundary, and short word length. On the one hand, most of the previous works only consider corrections with similar character pronunciation or shape, failing to correct visually and phonologically irrelevant typos. On the other hand, pipeline-style architectures are widely adopted to deal with different types of spelling errors in individual modules, which is difficult to optimize. In order to handle these issues, in this work, 1) we extend the traditional confusion sets with semantical candidates to cover different types of errors; 2) we propose a chunk-based framework to correct single-character and multi-character word errors uniformly; and 3) we adopt a global optimization strategy to enable a sentence-level correction selection. The experimental results show that the proposed approach achieves a new state-of-the-art performance on three benchmark datasets, as well as an optical character recognition dataset.

pdf bib
Integrating Task Specific Information into Pretrained Language Models for Low Resource Fine Tuning
Rui Wang | Shijing Si | Guoyin Wang | Lei Zhang | Lawrence Carin | Ricardo Henao
Findings of the Association for Computational Linguistics: EMNLP 2020

Pretrained Language Models (PLMs) have improved the performance of natural language understanding in recent years. Such models are pretrained on large corpora, which encode the general prior knowledge of natural languages but are agnostic to information characteristic of downstream tasks. This often results in overfitting when fine-tuned with low resource datasets where task-specific information is limited. In this paper, we integrate label information as a task-specific prior into the self-attention component of pretrained BERT models. Experiments on several benchmarks and real-word datasets suggest that the proposed approach can largely improve the performance of pretrained models when fine-tuning with small datasets.

pdf bib
Reference Language based Unsupervised Neural Machine Translation
Zuchao Li | Hai Zhao | Rui Wang | Masao Utiyama | Eiichiro Sumita
Findings of the Association for Computational Linguistics: EMNLP 2020

Exploiting a common language as an auxiliary for better translation has a long tradition in machine translation and lets supervised learning-based machine translation enjoy the enhancement delivered by the well-used pivot language in the absence of a source language to target language parallel corpus. The rise of unsupervised neural machine translation (UNMT) almost completely relieves the parallel corpus curse, though UNMT is still subject to unsatisfactory performance due to the vagueness of the clues available for its core back-translation training. Further enriching the idea of pivot translation by extending the use of parallel corpora beyond the source-target paradigm, we propose a new reference language-based framework for UNMT, RUNMT, in which the reference language only shares a parallel corpus with the source, but this corpus still indicates a signal clear enough to help the reconstruction training of UNMT through a proposed reference agreement mechanism. Experimental results show that our methods improve the quality of UNMT over that of a strong baseline that uses only one auxiliary language, demonstrating the usefulness of the proposed reference language-based UNMT and establishing a good start for the community.

pdf bib
Neural Topic Modeling with Bidirectional Adversarial Training
Rui Wang | Xuemeng Hu | Deyu Zhou | Yulan He | Yuxuan Xiong | Chenchen Ye | Haiyang Xu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent years have witnessed a surge of interests of using neural topic models for automatic topic extraction from text, since they avoid the complicated mathematical derivations for model inference as in traditional topic models such as Latent Dirichlet Allocation (LDA). However, these models either typically assume improper prior (e.g. Gaussian or Logistic Normal) over latent topic space or could not infer topic distribution for a given document. To address these limitations, we propose a neural topic modeling approach, called Bidirectional Adversarial Topic (BAT) model, which represents the first attempt of applying bidirectional adversarial training for neural topic modeling. The proposed BAT builds a two-way projection between the document-topic distribution and the document-word distribution. It uses a generator to capture the semantic patterns from texts and an encoder for topic inference. Furthermore, to incorporate word relatedness information, the Bidirectional Adversarial Topic model with Gaussian (Gaussian-BAT) is extended from BAT. To verify the effectiveness of BAT and Gaussian-BAT, three benchmark corpora are used in our experiments. The experimental results show that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines. Moreover, when performing text clustering based on the extracted topics, our models outperform all the baselines, with more significant improvements achieved by Gaussian-BAT where an increase of near 6% is observed in accuracy.

pdf bib
Content Word Aware Neural Machine Translation
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural machine translation (NMT) encodes the source sentence in a universal way to generate the target sentence word-by-word. However, NMT does not consider the importance of word in the sentence meaning, for example, some words (i.e., content words) express more important meaning than others (i.e., function words). To address this limitation, we first utilize word frequency information to distinguish between content and function words in a sentence, and then design a content word-aware NMT to improve translation performance. Empirical results on the WMT14 English-to-German, WMT14 English-to-French, and WMT17 Chinese-to-English translation tasks show that the proposed methods can significantly improve the performance of Transformer-based NMT.

pdf bib
Relational Graph Attention Network for Aspect-based Sentiment Analysis
Kai Wang | Weizhou Shen | Yunyi Yang | Xiaojun Quan | Rui Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Aspect-based sentiment analysis aims to determine the sentiment polarity towards a specific aspect in online reviews. Most recent efforts adopt attention-based neural network models to implicitly connect aspects with opinion words. However, due to the complexity of language and the existence of multiple aspects in a single sentence, these models often confuse the connections. In this paper, we address this problem by means of effective encoding of syntax information. Firstly, we define a unified aspect-oriented dependency tree structure rooted at a target aspect by reshaping and pruning an ordinary dependency parse tree. Then, we propose a relational graph attention network (R-GAT) to encode the new tree structure for sentiment prediction. Extensive experiments are conducted on the SemEval 2014 and Twitter datasets, and the experimental results confirm that the connections between aspects and opinion words can be better established with our approach, and the performance of the graph attention network (GAT) is significantly improved as a consequence.

pdf bib
Syntax-Aware Opinion Role Labeling with Dependency Graph Convolutional Networks
Bo Zhang | Yue Zhang | Rui Wang | Zhenghua Li | Min Zhang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Opinion role labeling (ORL) is a fine-grained opinion analysis task and aims to answer “who expressed what kind of sentiment towards what?”. Due to the scarcity of labeled data, ORL remains challenging for data-driven methods. In this work, we try to enhance neural ORL models with syntactic knowledge by comparing and integrating different representations. We also propose dependency graph convolutional networks (DEPGCN) to encode parser information at different processing levels. In order to compensate for parser inaccuracy and reduce error propagation, we introduce multi-task learning (MTL) to train the parser and the ORL model simultaneously. We verify our methods on the benchmark MPQA corpus. The experimental results show that syntactic information is highly valuable for ORL, and our final MTL model effectively boosts the F1 score by 9.29 over the syntax-agnostic baseline. In addition, we find that the contributions from syntactic knowledge do not fully overlap with contextualized word representations (BERT). Our best model achieves 4.34 higher F1 score than the current state-ofthe-art.

pdf bib
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. However, it can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. That is, research on multilingual UNMT has been limited. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder, making use of multilingual data to improve UNMT for all language pairs. On the basis of the empirical findings, we propose two knowledge distillation methods to further enhance multilingual UNMT performance. Our experiments on a dataset with English translated to and from twelve other languages (including three language families and six language branches) show remarkable results, surpassing strong unsupervised individual baselines while achieving promising performance between non-English language pairs in zero-shot translation scenarios and alleviating poor performance in low-resource language pairs.

pdf bib
Multi-Domain Dialogue Acts and Response Co-Generation
Kai Wang | Junfeng Tian | Rui Wang | Xiaojun Quan | Jianxing Yu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Generating fluent and informative responses is of critical importance for task-oriented dialogue systems. Existing pipeline approaches generally predict multiple dialogue acts first and use them to assist response generation. There are at least two shortcomings with such approaches. First, the inherent structures of multi-domain dialogue acts are neglected. Second, the semantic associations between acts and responses are not taken into account for response generation. To address these issues, we propose a neural co-generation model that generates dialogue acts and responses concurrently. Unlike those pipeline approaches, our act generation module preserves the semantic structures of multi-domain dialogue acts and our response generation module dynamically attends to different acts as needed. We train the two modules jointly using an uncertainty loss to adjust their task weights adaptively. Extensive experiments are conducted on the large-scale MultiWOZ dataset and the results show that our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.

pdf bib
Regularized Context Gates on Transformer for Machine Translation
Xintong Li | Lemao Liu | Rui Wang | Guoping Huang | Max Meng
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Context gates are effective to control the contributions from the source and target contexts in the recurrent neural network (RNN) based neural machine translation (NMT). However, it is challenging to extend them into the advanced Transformer architecture, which is more complicated than RNN. This paper first provides a method to identify source and target contexts and then introduce a gate mechanism to control the source and target contributions in Transformer. In addition, to further reduce the bias problem in the gate mechanism, this paper proposes a regularization method to guide the learning of the gates with supervision automatically generated using pointwise mutual information. Extensive experiments on 4 translation datasets demonstrate that the proposed model obtains an averaged gain of 1.0 BLEU score over a strong Transformer baseline.

pdf bib
SJTU-NICT’s Supervised and Unsupervised Neural Machine Translation Systems for the WMT20 News Translation Task
Zuchao Li | Hai Zhao | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita
Proceedings of the Fifth Conference on Machine Translation

In this paper, we introduced our joint team SJTU-NICT ‘s participation in the WMT 2020 machine translation shared task. In this shared task, we participated in four translation directions of three language pairs: English-Chinese, English-Polish on supervised machine translation track, German-Upper Sorbian on low-resource and unsupervised machine translation tracks. Based on different conditions of language pairs, we have experimented with diverse neural machine translation (NMT) techniques: document-enhanced NMT, XLM pre-trained language model enhanced NMT, bidirectional translation as a pre-training, reference language based UNMT, data-dependent gaussian prior objective, and BT-BLEU collaborative filtering self-training. We also used the TF-IDF algorithm to filter the training set to obtain a domain more similar set with the test set for finetuning. In our submissions, the primary systems won the first place on English to Chinese, Polish to English, and German to Upper Sorbian translation directions.

pdf bib
Neural Topic Modeling by Incorporating Document Relationship Graph
Deyu Zhou | Xuemeng Hu | Rui Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Graph Neural Networks (GNNs) that capture the relationships between graph nodes via message passing have been a hot research direction in the natural language processing community. In this paper, we propose Graph Topic Model (GTM), a GNN based neural topic model that represents a corpus as a document relationship graph. Documents and words in the corpus become nodes in the graph and are connected based on document-word co-occurrences. By introducing the graph structure, the relationships between documents are established through their shared words and thus the topical representation of a document is enriched by aggregating information from its neighboring nodes using graph convolution. Extensive experiments on three datasets were conducted and the results demonstrate the effectiveness of the proposed approach.

pdf bib
Neural Topic Modeling with Cycle-Consistent Adversarial Training
Xuemeng Hu | Rui Wang | Deyu Zhou | Yuxuan Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Advances on deep generative models have attracted significant research interest in neural topic modeling. The recently proposed Adversarial-neural Topic Model models topics with an adversarially trained generator network and employs Dirichlet prior to capture the semantic patterns in latent topics. It is effective in discovering coherent topics but unable to infer topic distributions for given documents or utilize available document labels. To overcome such limitations, we propose Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT) and its supervised version sToMCAT. ToMCAT employs a generator network to interpret topics and an encoder network to infer document topics. Adversarial training and cycle-consistent constraints are used to encourage the generator and the encoder to produce realistic samples that coordinate with each other. sToMCAT extends ToMCAT by incorporating document labels into the topic modeling process to help discover more coherent topics. The effectiveness of the proposed models is evaluated on unsupervised/supervised topic modeling and text classification. The experimental results show that our models can produce both coherent and informative topics, outperforming a number of competitive baselines.

2019

pdf bib
Open Event Extraction from Online Text using a Generative Adversarial Network
Rui Wang | Deyu Zhou | Yulan He
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

To extract the structured representations of open-domain events, Bayesian graphical models have made some progress. However, these approaches typically assume that all words in a document are generated from a single event. While this may be true for short text such as tweets, such an assumption does not generally hold for long text such as news articles. Moreover, Bayesian graphical models often rely on Gibbs sampling for parameter inference which may take long time to converge. To address these limitations, we propose an event extraction model based on Generative Adversarial Nets, called Adversarial-neural Event Model (AEM). AEM models an event with a Dirichlet prior and uses a generator network to capture the patterns underlying latent events. A discriminator is used to distinguish documents reconstructed from the latent events and the original documents. A byproduct of the discriminator is that the features generated by the learned discriminator network allow the visualization of the extracted events. Our model has been evaluated on two Twitter datasets and a news article dataset. Experimental results show that our model outperforms the baseline approaches on all the datasets, with more significant improvements observed on the news article dataset where an increase of 15% is observed in F-measure.

pdf bib
Syntax-Enhanced Self-Attention-Based Semantic Role Labeling
Yue Zhang | Rui Wang | Luo Si
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

As a fundamental NLP task, semantic role labeling (SRL) aims to discover the semantic roles for each predicate within one sentence. This paper investigates how to incorporate syntactic knowledge into the SRL task effectively. We present different approaches of en- coding the syntactic information derived from dependency trees of different quality and representations; we propose a syntax-enhanced self-attention model and compare it with other two strong baseline methods; and we con- duct experiments with newly published deep contextualized word representations as well. The experiment results demonstrate that with proper incorporation of the high quality syntactic information, our model achieves a new state-of-the-art performance for the Chinese SRL task on the CoNLL-2009 dataset.

pdf bib
Attention Optimization for Abstractive Document Summarization
Min Gui | Junfeng Tian | Rui Wang | Zhenglu Yang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Attention plays a key role in the improvement of sequence-to-sequence-based document summarization models. To obtain a powerful attention helping with reproducing the most salient information and avoiding repetitions, we augment the vanilla attention model from both local and global aspects. We propose attention refinement unit paired with local variance loss to impose supervision on the attention model at each decoding step, and we also propose a global variance loss to optimize the attention distributions of all decoding steps from the global perspective. The performances on CNN/Daily Mail dataset verify the effectiveness of our methods.

pdf bib
Recurrent Positional Embedding for Neural Machine Translation
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In the Transformer network architecture, positional embeddings are used to encode order dependencies into the input representation. However, this input representation only involves static order dependencies based on discrete numerical information, that is, are independent of word content. To address this issue, this work proposes a recurrent positional embedding approach based on word vector. In this approach, these recurrent positional embeddings are learned by a recurrent neural network, encoding word content-based order dependencies into the input representation. They are then integrated into the existing multi-head self-attention model as independent heads or part of each head. The experimental results revealed that the proposed approach improved translation performance over that of the state-of-the-art Transformer baseline in WMT’14 English-to-German and NIST Chinese-to-English translation tasks.

pdf bib
English-Myanmar Supervised and Unsupervised NMT: NICT’s Machine Translation Systems at WAT-2019
Rui Wang | Haipeng Sun | Kehai Chen | Chenchen Ding | Masao Utiyama | Eiichiro Sumita
Proceedings of the 6th Workshop on Asian Translation

This paper presents the NICT’s participation (team ID: NICT) in the 6th Workshop on Asian Translation (WAT-2019) shared translation task, specifically Myanmar (Burmese) - English task in both translation directions. We built neural machine translation (NMT) systems for these tasks. Our NMT systems were trained with language model pretraining. Back-translation technology is adopted to NMT. Our NMT systems rank the third in English-to-Myanmar and the second in Myanmar-to-English according to BLEU score.

pdf bib
SJTU-NICT at MRP 2019: Multi-Task Learning for End-to-End Uniform Semantic Graph Parsing
Zuchao Li | Hai Zhao | Zhuosheng Zhang | Rui Wang | Masao Utiyama | Eiichiro Sumita
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

This paper describes our SJTU-NICT’s system for participating in the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). Our system uses a graph-based approach to model a variety of semantic graph parsing tasks. Our main contributions in the submitted system are summarized as follows: 1. Our model is fully end-to-end and is capable of being trained only on the given training set which does not rely on any other extra training source including the companion data provided by the organizer; 2. We extend our graph pruning algorithm to a variety of semantic graphs, solving the problem of excessive semantic graph search space; 3. We introduce multi-task learning for multiple objectives within the same framework. The evaluation results show that our system achieved second place in the overall F1 score and achieved the best F1 score on the DM framework.

pdf bib
SUDA-Alibaba at MRP 2019: Graph-Based Models with BERT
Yue Zhang | Wei Jiang | Qingrong Xia | Junjie Cao | Rui Wang | Zhenghua Li | Min Zhang
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning

In this paper, we describe our participating systems in the shared task on Cross- Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL). The task includes five frameworks for graph-based meaning representations, i.e., DM, PSD, EDS, UCCA, and AMR. One common characteristic of our systems is that we employ graph-based methods instead of transition-based methods when predicting edges between nodes. For SDP, we jointly perform edge prediction, frame tagging, and POS tagging via multi-task learning (MTL). For UCCA, we also jointly model a constituent tree parsing and a remote edge recovery task. For both EDS and AMR, we produce nodes first and edges second in a pipeline fashion. External resources like BERT are found helpful for all frameworks except AMR. Our final submission ranks the third on the overall MRP evaluation metric, the first on EDS and the second on UCCA.

pdf bib
NICT’s Supervised Neural Machine Translation Systems for the WMT19 News Translation Task
Raj Dabre | Kehai Chen | Benjamin Marie | Rui Wang | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper, we describe our supervised neural machine translation (NMT) systems that we developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions. We focused on leveraging multilingual transfer learning and back-translation for the extremely low-resource language pairs: Kazakh↔English and Gujarati↔English translation. For the Chinese↔English translation, we used the provided parallel data augmented with a large quantity of back-translated monolingual data to train state-of-the-art NMT systems. We then employed techniques that have been proven to be most effective, such as back-translation, fine-tuning, and model ensembling, to generate the primary submissions of Chinese↔English. For English→Finnish, our submission from WMT18 remains a strong baseline despite the increase in parallel corpora for this year’s task.

pdf bib
NICT’s Unsupervised Neural and Statistical Machine Translation Systems for the WMT19 News Translation Task
Benjamin Marie | Haipeng Sun | Rui Wang | Kehai Chen | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper presents the NICT’s participation in the WMT19 unsupervised news translation task. We participated in the unsupervised translation direction: German-Czech. Our primary submission to the task is the result of a simple combination of our unsupervised neural and statistical machine translation systems. Our system is ranked first for the German-to-Czech translation task, using only the data provided by the organizers (“constraint’”), according to both BLEU-cased and human evaluation. We also performed contrastive experiments with other language pairs, namely, English-Gujarati and English-Kazakh, to better assess the effectiveness of unsupervised machine translation in for distant language pairs and in truly low-resource conditions.

pdf bib
Unsupervised Bilingual Word Embedding Agreement for Unsupervised Neural Machine Translation
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Unsupervised bilingual word embedding (UBWE), together with other technologies such as back-translation and denoising, has helped unsupervised neural machine translation (UNMT) achieve remarkable results in several language pairs. In previous methods, UBWE is first trained using non-parallel monolingual corpora and then this pre-trained UBWE is used to initialize the word embedding in the encoder and decoder of UNMT. That is, the training of UBWE and UNMT are separate. In this paper, we first empirically investigate the relationship between UBWE and UNMT. The empirical findings show that the performance of UNMT is significantly affected by the performance of UBWE. Thus, we propose two methods that train UNMT with UBWE agreement. Empirical results on several language pairs show that the proposed methods significantly outperform conventional UNMT.

pdf bib
Neural Machine Translation with Reordering Embeddings
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The reordering model plays an important role in phrase-based statistical machine translation. However, there are few works that exploit the reordering information in neural machine translation. In this paper, we propose a reordering mechanism to learn the reordering embedding of a word based on its contextual information. These learned reordering embeddings are stacked together with self-attention networks to learn sentence representation for machine translation. The reordering mechanism can be easily integrated into both the encoder and the decoder in the Transformer translation system. Experimental results on WMT’14 English-to-German, NIST Chinese-to-English, and WAT Japanese-to-English translation tasks demonstrate that the proposed methods can significantly improve the performance of the Transformer.

pdf bib
BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization
Kai Wang | Xiaojun Quan | Rui Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The success of neural summarization models stems from the meticulous encodings of source articles. To overcome the impediments of limited and sometimes noisy training data, one promising direction is to make better use of the available training data by applying filters during summarization. In this paper, we propose a novel Bi-directional Selective Encoding with Template (BiSET) model, which leverages template discovered from training data to softly select key information from each source article to guide its summarization process. Extensive experiments on a standard summarization dataset are conducted and the results show that the template-equipped BiSET model manages to improve the summarization performance significantly with a new state of the art.

pdf bib
Semi-supervised Domain Adaptation for Dependency Parsing
Zhenghua Li | Xue Peng | Min Zhang | Rui Wang | Luo Si
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

During the past decades, due to the lack of sufficient labeled data, most studies on cross-domain parsing focus on unsupervised domain adaptation, assuming there is no target-domain training data. However, unsupervised approaches make limited progress so far due to the intrinsic difficulty of both domain adaptation and parsing. This paper tackles the semi-supervised domain adaptation problem for Chinese dependency parsing, based on two newly-annotated large-scale domain-aware datasets. We propose a simple domain embedding approach to merge the source- and target-domain training data, which is shown to be more effective than both direct corpus concatenation and multi-task learning. In order to utilize unlabeled target-domain data, we employ the recent contextualized word representations and show that a simple fine-tuning procedure can further boost cross-domain parsing accuracy by large margin.

pdf bib
Sentence-Level Agreement for Neural Machine Translation
Mingming Yang | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Min Zhang | Tiejun Zhao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The training objective of neural machine translation (NMT) is to minimize the loss between the words in the translated sentences and those in the references. In NMT, there is a natural correspondence between the source sentence and the target sentence. However, this relationship has only been represented using the entire neural network and the training objective is computed in word-level. In this paper, we propose a sentence-level agreement module to directly minimize the difference between the representation of source and target sentence. The proposed agreement module can be integrated into NMT as an additional training objective function and can also be used to enhance the representation of the source sentences. Empirical results on the NIST Chinese-to-English and WMT English-to-German tasks show the proposed agreement module can significantly improve the NMT performance.

pdf bib
Lattice-Based Transformer Encoder for Neural Machine Translation
Fengshun Xiao | Jiangtong Li | Hai Zhao | Rui Wang | Kehai Chen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Neural machine translation (NMT) takes deterministic sequences for source representations. However, either word-level or subword-level segmentations have multiple choices to split a source sequence with different word segmentors or different subword vocabulary sizes. We hypothesize that the diversity in segmentations may affect the NMT performance. To integrate different segmentations with the state-of-the-art NMT model, Transformer, we propose lattice-based encoders to explore effective word or subword representation in an automatic way during training. We propose two methods: 1) lattice positional encoding and 2) lattice-aware self-attention. These two methods can be used together and show complementary to each other to further improve translation performance. Experiment results show superiorities of lattice-based encoders in word-level and subword-level representations over conventional Transformer encoder.

2018

pdf bib
A Survey of Domain Adaptation for Neural Machine Translation
Chenhui Chu | Rui Wang
Proceedings of the 27th International Conference on Computational Linguistics

Neural machine translation (NMT) is a deep learning based approach for machine translation, which yields the state-of-the-art translation performance in scenarios where large-scale parallel corpora are available. Although the high-quality and domain-specific translation is crucial in the real world, domain-specific corpora are usually scarce or nonexistent, and thus vanilla NMT performs poorly in such scenarios. Domain adaptation that leverages both out-of-domain parallel corpora as well as monolingual corpora for in-domain translation, is very important for domain-specific translation. In this paper, we give a comprehensive survey of the state-of-the-art domain adaptation techniques for NMT.

pdf bib
English-Myanmar NMT and SMT with Pre-ordering: NICT’s Machine Translation Systems at WAT-2018
Rui Wang | Chenchen Ding | Masao Utiyama | Eiichiro Sumita
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation: 5th Workshop on Asian Translation: 5th Workshop on Asian Translation

pdf bib
Dynamic Sentence Sampling for Efficient Training of Neural Machine Translation
Rui Wang | Masao Utiyama | Eiichiro Sumita
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Traditional Neural machine translation (NMT) involves a fixed training procedure where each sentence is sampled once during each epoch. In reality, some sentences are well-learned during the initial few epochs; however, using this approach, the well-learned sentences would continue to be trained along with those sentences that were not well learned for 10-30 epochs, which results in a wastage of time. Here, we propose an efficient method to dynamically sample the sentences in order to accelerate the NMT training. In this approach, a weight is assigned to each sentence based on the measured difference between the training costs of two iterations. Further, in each epoch, a certain percentage of sentences are dynamically sampled according to their weights. Empirical results based on the NIST Chinese-to-English and the WMT English-to-German tasks show that the proposed method can significantly accelerate the NMT training and improve the NMT performance.

pdf bib
NICT’s Neural and Statistical Machine Translation Systems for the WMT18 News Translation Task
Benjamin Marie | Rui Wang | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the NICT’s participation to the WMT18 shared news translation task. We participated in the eight translation directions of four language pairs: Estonian-English, Finnish-English, Turkish-English and Chinese-English. For each translation direction, we prepared state-of-the-art statistical (SMT) and neural (NMT) machine translation systems. Our NMT systems were trained with the transformer architecture using the provided parallel data enlarged with a large quantity of back-translated monolingual data that we generated with a new incremental training framework. Our primary submissions to the task are the result of a simple combination of our SMT and NMT systems. Our systems are ranked first for the Estonian-English and Finnish-English language pairs (constraint) according to BLEU-cased.

pdf bib
NICT’s Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task
Rui Wang | Benjamin Marie | Masao Utiyama | Eiichiro Sumita
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper presents the NICT’s participation in the WMT18 shared parallel corpus filtering task. The organizers provided 1 billion words German-English corpus crawled from the web as part of the Paracrawl project. This corpus is too noisy to build an acceptable neural machine translation (NMT) system. Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data. Finally, we sampled 100 million and 10 million words and built corresponding NMT systems. Empirical results show that our NMT systems trained on sampled data achieve promising performance.

pdf bib
Exploring Recombination for Efficient Decoding of Neural Machine Translation
Zhisong Zhang | Rui Wang | Masao Utiyama | Eiichiro Sumita | Hai Zhao
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In Neural Machine Translation (NMT), the decoder can capture the features of the entire prediction history with neural connections and representations. This means that partial hypotheses with different prefixes will be regarded differently no matter how similar they are. However, this might be inefficient since some partial hypotheses can contain only local differences that will not influence future predictions. In this work, we introduce recombination in NMT decoding based on the concept of the “equivalence” of partial hypotheses. Heuristically, we use a simple n-gram suffix based equivalence function and adapt it into beam search decoding. Through experiments on large-scale Chinese-to-English and English-to-Germen translation tasks, we show that the proposed method can obtain similar translation quality with a smaller beam size, making NMT decoding more efficient.

2017

pdf bib
Sentence Embedding for Neural Machine Translation Domain Adaptation
Rui Wang | Andrew Finch | Masao Utiyama | Eiichiro Sumita
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Although new corpora are becoming increasingly available for machine translation, only those that belong to the same or similar domains are typically able to improve translation performance. Recently Neural Machine Translation (NMT) has become prominent in the field. However, most of the existing domain adaptation methods only focus on phrase-based machine translation. In this paper, we exploit the NMT’s internal embedding of the source sentence and use the sentence embedding similarity to select the sentences which are close to in-domain data. The empirical adaptation results on the IWSLT English-French and NIST Chinese-English tasks show that the proposed methods can substantially improve NMT performance by 2.4-9.0 BLEU points, outperforming the existing state-of-the-art baseline by 2.3-4.5 BLEU points.

pdf bib
Instance Weighting for Neural Machine Translation Domain Adaptation
Rui Wang | Masao Utiyama | Lemao Liu | Kehai Chen | Eiichiro Sumita
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Instance weighting has been widely applied to phrase-based machine translation domain adaptation. However, it is challenging to be applied to Neural Machine Translation (NMT) directly, because NMT is not a linear model. In this paper, two instance weighting technologies, i.e., sentence weighting and domain weighting with a dynamic weight learning strategy, are proposed for NMT domain adaptation. Empirical results on the IWSLT English-German/French tasks show that the proposed methods can substantially improve NMT performance by up to 2.7-6.7 BLEU points, outperforming the existing baselines by up to 1.6-3.6 BLEU points.

pdf bib
Neural Machine Translation with Source Dependency Representation
Kehai Chen | Rui Wang | Masao Utiyama | Lemao Liu | Akihiro Tamura | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Source dependency information has been successfully introduced into statistical machine translation. However, there are only a few preliminary attempts for Neural Machine Translation (NMT), such as concatenating representations of source word and its dependency label together. In this paper, we propose a novel NMT with source dependency representation to improve translation performance of NMT, especially long sentences. Empirical results on NIST Chinese-to-English translation task show that our method achieves 1.6 BLEU improvements on average over a strong NMT system.

pdf bib
Context-Aware Smoothing for Neural Machine Translation
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In Neural Machine Translation (NMT), each word is represented as a low-dimension, real-value vector for encoding its syntax and semantic information. This means that even if the word is in a different sentence context, it is represented as the fixed vector to learn source representation. Moreover, a large number of Out-Of-Vocabulary (OOV) words, which have different syntax and semantic information, are represented as the same vector representation of “unk”. To alleviate this problem, we propose a novel context-aware smoothing method to dynamically learn a sentence-specific vector for each word (including OOV words) depending on its local context words in a sentence. The learned context-aware representation is integrated into the NMT to improve the translation performance. Empirical results on NIST Chinese-to-English translation task show that the proposed approach achieves 1.78 BLEU improvements on average over a strong attentional NMT, and outperforms some existing systems.

2016

pdf bib
Featureless Domain-Specific Term Extraction with Minimal Labelled Data
Rui Wang | Wei Liu | Chris McDonald
Proceedings of the Australasian Language Technology Association Workshop 2016

pdf bib
Connecting Phrase based Statistical Machine Translation Adaptation
Rui Wang | Hai Zhao | Bao-Liang Lu | Masao Utiyama | Eiichiro Sumita
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Although more additional corpora are now available for Statistical Machine Translation (SMT), only the ones which belong to the same or similar domains of the original corpus can indeed enhance SMT performance directly. A series of SMT adaptation methods have been proposed to select these similar-domain data, and most of them focus on sentence selection. In comparison, phrase is a smaller and more fine grained unit for data selection, therefore we propose a straightforward and efficient connecting phrase based adaptation method, which is applied to both bilingual phrase pair and monolingual n-gram adaptation. The proposed method is evaluated on IWSLT/NIST data sets, and the results show that phrase based SMT performances are significantly improved (up to +1.6 in comparison with phrase based SMT baseline system and +0.9 in comparison with existing methods).

2015

pdf bib
English to Chinese Translation: How Chinese Character Matters
Rui Wang | Hai Zhao | Bao-Liang Lu
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Neural Network Language Model for Chinese Pinyin Input Method Engine
Shenyuan Chen | Hai Zhao | Rui Wang
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
A Machine Learning Method to Distinguish Machine Translation from Human Translation
Yitong Li | Rui Wang | Hai Zhao
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf bib
Senti-LSSVM: Sentiment-Oriented Multi-Relation Extraction with Latent Structural SVM
Lizhen Qu | Yi Zhang | Rui Wang | Lili Jiang | Rainer Gemulla | Gerhard Weikum
Transactions of the Association for Computational Linguistics, Volume 2

Extracting instances of sentiment-oriented relations from user-generated web documents is important for online marketing analysis. Unlike previous work, we formulate this extraction task as a structured prediction problem and design the corresponding inference as an integer linear program. Our latent structural SVM based model can learn from training corpora that do not contain explicit annotations of sentiment-bearing expressions, and it can simultaneously recognize instances of both binary (polarity) and ternary (comparative) relations with regard to entity mentions of interest. The empirical evaluation shows that our approach significantly outperforms state-of-the-art systems across domains (cameras and movies) and across genres (reviews and forum posts). The gold standard corpus that we built will also be a valuable resource for the community.

pdf bib
The SAS Statistical Machine Translation System for WAT 2014
Rui Wang | Xu Yang | Yan Gao
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
Aligning Predicate-Argument Structures for Paraphrase Fragment Extraction
Michaela Regneri | Rui Wang | Manfred Pinkal
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Paraphrases and paraphrasing algorithms have been found of great importance in various natural language processing tasks. While most paraphrase extraction approaches extract equivalent sentences, sentences are an inconvenient unit for further processing, because they are too specific, and often not exact paraphrases. Paraphrase fragment extraction is a technique that post-processes sentential paraphrases and prunes them to more convenient phrase-level units. We present a new approach that uses semantic roles to extract paraphrase fragments from sentence pairs that share semantic content to varying degrees, including full paraphrases. In contrast to previous systems, the use of semantic parses allows for extracting paraphrases with high wording variance and different syntactic categories. The approach is tested on four different input corpora and compared to two previous systems for extracting paraphrase fragments. Our system finds three times as many good paraphrase fragments per sentence pair as the baselines, and at the same time outputs 30% fewer unrelated fragment pairs.

pdf bib
Neural Network Based Bilingual Language Model Growing for Statistical Machine Translation
Rui Wang | Hai Zhao | Bao-Liang Lu | Masao Utiyama | Eiichiro Sumita
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Converting Continuous-Space Language Models into N-Gram Language Models for Statistical Machine Translation
Rui Wang | Masao Utiyama | Isao Goto | Eiichro Sumita | Hai Zhao | Bao-Liang Lu
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf bib
Using Discourse Information for Paraphrase Extraction
Michaela Regneri | Rui Wang
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)

pdf bib
Linguistically-Enriched Models for Bulgarian-to-English Machine Translation
Rui Wang | Petya Osenova | Kiril Simov
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Joint Grammar and Treebank Development for Mandarin Chinese with HPSG
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the ongoing development of MCG, a linguistically deep and precise grammar for Mandarin Chinese together with its accompanying treebank, both based on the linguistic framework of HPSG, and using MRS as the semantic representation. We highlight some key features of our grammar design, and review a number of challenging phenomena, with comparisons to alternative linguistic treatments and implementations. One of the distinguishing characteristics of our approach is the tight integration of grammar and treebank development. The two-step treebank annotation procedure benefits from the efficiency of the discriminant-based annotation approach, while giving the annotators full freedom of producing extra-grammatical structures. This not only allows the creation of a precise and full-coverage treebank with an imperfect grammar, but also provides prompt feedback for grammarians to identify the errors in the grammar design and implementation. Preliminary evaluation and error analysis shows that the grammar already covers most of the core phenomena for Mandarin Chinese, and the treebank annotation procedure reaches a stable speed of 35 sentences per hour with satisfying quality.

pdf bib
Constructing a Question Corpus for Textual Semantic Relations
Rui Wang | Shuguang Li
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Finding useful questions is a challenging task in Community Question Answering (CQA). There are two key issues need to be resolved: 1) what is a useful question to the given reference question; and furthermore 2) what kind of relations exist between a given pair of questions. In order to answer these two questions, in this paper, we propose a fine-grained inventory of textual semantic relations between questions and annotate a corpus constructed from the WikiAnswers website. We also extract large archives of question pairs with user-generated links and use them as labeled data for separating useful questions from neutral ones, achieving 72.2% of accuracy. We find such online CQA repositories valuable resources for related research.

pdf bib
Sentence Realization with Unlexicalized Tree Linearization Grammars
Rui Wang | Yi Zhang
Proceedings of COLING 2012: Posters

2011

pdf bib
Paraphrase Fragment Extraction from Monolingual Comparable Corpora
Rui Wang | Chris Callison-Burch
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

pdf bib
Statistical Machine Transliteration with Multi-to-Multi Joint Source Channel Model
Yu Chen | Rui Wang | Yi Zhang
Proceedings of the 3rd Named Entities Workshop (NEWS 2011)

pdf bib
Engineering a Deep HPSG for Mandarin Chinese
Yi Zhang | Rui Wang | Yu Chen
Proceedings of the 9th Workshop on Asian Language Resources

pdf bib
The ACL Anthology Searchbench
Ulrich Schäfer | Bernd Kiefer | Christian Spurk | Jörg Steffen | Rui Wang
Proceedings of the ACL-HLT 2011 System Demonstrations

2010

pdf bib
MARS: A Specialized RTE System for Parser Evaluation
Rui Wang | Yi Zhang
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Cheap Facts and Counter-Facts
Rui Wang | Chris Callison-Burch
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Discriminative Parse Reranking for Chinese with Homogeneous and Heterogeneous Annotations
Weiwei Sun | Rui Wang | Yi Zhang
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Constructing a Textual Semantic Relation Corpus Using a Discourse Treebank
Rui Wang | Caroline Sporleder
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present our work on constructing a textual semantic relation corpus by making use of an existing treebank annotated with discourse relations. We extract adjacent text span pairs and group them into six categories according to the different discourse relations between them. After that, we present the details of our annotation scheme, which includes six textual semantic relations, 'backward entailment', 'forward entailment', 'equality', 'contradiction', 'overlapping', and 'independent'. We also discuss some ambiguous examples to show the difficulty of such annotation task, which cannot be easily done by an automatic mapping between discourse relations and semantic relations. We have two annotators and each of them performs the task twice. The basic statistics on the constructed corpus looks promising: we achieve 81.17% of agreement on the six semantic relation annotation with a .718 kappa score, and it increases to 91.21% if we collapse the last two labels with a .775 kappa score.

pdf bib
Hybrid Constituent and Dependency Parsing with Tsinghua Chinese Treebank
Rui Wang | Yi Zhang
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we describe our hybrid parsing model on the Mandarin Chinese processing. In particular, we work on the Tsinghua Chinese Treebank (TCT), whose annotation has both constitutes and the head information of each constitute. The model we design combines the mainstream constitute parsing and dependency parsing. We present in detail 1) how to (partially) encode the head information into the constitute parsing, 2) how to encode constitute information into the dependency parsing, and 3) how to restore the head information using the dependency structure. For each of them, we take different strategies to deal with different cases. In an open shared task evaluation, we achieve an f1-score of 85.23% for the constitute parsing, 82.35% with partial head information, and 74.27% with complete head information. The error analysis shows the challenge of restoring multiple-headed constitutes and also some potentials to use the dependency structure to guide the constitute parsing, which will be our future work to explore.

2009

pdf bib
Cross-Domain Dependency Parsing Using a Deep Linguistic Grammar
Yi Zhang | Rui Wang
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Inference Rules and their Application to Recognizing Textual Entailment
Georgiana Dinu | Rui Wang
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Recognizing Textual Relatedness with Predicate-Argument Structures
Rui Wang | Yi Zhang
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Hybrid Multilingual Parsing with HPSG for SRL
Yi Zhang | Rui Wang | Stephan Oepen
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf bib
Inference Rules for Recognizing Textual Entailment
Georgiana Dinu | Rui Wang
Proceedings of the Eight International Conference on Computational Semantics

2008

pdf bib
Hybrid Learning of Dependency Structures from Heterogeneous Linguistic Resources
Yi Zhang | Rui Wang | Hans Uszkoreit
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

2007

pdf bib
Recognizing Textual Entailment Using Sentence Similarity based on Dependency Tree Skeletons
Rui Wang | Günter Neumann
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing