Wei Xu


2021

pdf bib
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)
Antoine Bosselut | Esin Durmus | Varun Prashant Gangal | Sebastian Gehrmann | Yacine Jernite | Laura Perez-Beltrachini | Samira Shaikh | Wei Xu
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

pdf bib
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Sebastian Gehrmann | Tosin Adewumi | Karmanya Aggarwal | Pawan Sasanka Ammanamanchi | Anuoluwapo Aremu | Antoine Bosselut | Khyathi Raghavi Chandu | Miruna-Adriana Clinciu | Dipanjan Das | Kaustubh Dhole | Wanyu Du | Esin Durmus | Ondřej Dušek | Chris Chinenye Emezue | Varun Gangal | Cristina Garbacea | Tatsunori Hashimoto | Yufang Hou | Yacine Jernite | Harsh Jhamtani | Yangfeng Ji | Shailza Jolly | Mihir Kale | Dhruv Kumar | Faisal Ladhak | Aman Madaan | Mounica Maddela | Khyati Mahajan | Saad Mahamood | Bodhisattwa Prasad Majumder | Pedro Henrique Martins | Angelina McMillan-Major | Simon Mille | Emiel van Miltenburg | Moin Nadeem | Shashi Narayan | Vitaly Nikolaev | Andre Niyongabo Rubungo | Salomey Osei | Ankur Parikh | Laura Perez-Beltrachini | Niranjan Ramesh Rao | Vikas Raunak | Juan Diego Rodriguez | Sashank Santhanam | João Sedoc | Thibault Sellam | Samira Shaikh | Anastasia Shimorina | Marco Antonio Sobrevilla Cabezudo | Hendrik Strobelt | Nishant Subramani | Wei Xu | Diyi Yang | Akhila Yerukola | Jiawei Zhou
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)

We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for the 2021 shared task at the associated GEM Workshop.

pdf bib
Controllable Text Simplification with Explicit Paraphrasing
Mounica Maddela | Fernando Alva-Manchego | Wei Xu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Text Simplification improves the readability of sentences through several rewriting transformations, such as lexical paraphrasing, deletion, and splitting. Current simplification systems are predominantly sequence-to-sequence models that are trained end-to-end to perform all these operations simultaneously. However, such systems limit themselves to mostly deleting words and cannot easily adapt to the requirements of different target audiences. In this paper, we propose a novel hybrid approach that leverages linguistically-motivated rules for splitting and deletion, and couples them with a neural paraphrasing model to produce varied rewriting styles. We introduce a new data augmentation method to improve the paraphrasing capability of our model. Through automatic and manual evaluations, we show that our proposed model establishes a new state-of-the-art for the task, paraphrasing more often than the existing systems, and can control the degree of each simplification operation applied to the input texts.

pdf bib
Neural semi-Markov CRF for Monolingual Word Alignment
Wuwei Lan | Chao Jiang | Wei Xu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Monolingual word alignment is important for studying fine-grained editing operations (i.e., deletion, addition, and substitution) in text-to-text generation tasks, such as paraphrase generation, text simplification, neutralizing biased language, etc. In this paper, we present a novel neural semi-Markov CRF alignment model, which unifies word and phrase alignments through variable-length spans. We also create a new benchmark with human annotations that cover four different text genres to evaluate monolingual word alignment models in more realistic settings. Experimental results show that our proposed model outperforms all previous approaches for monolingual word alignment as well as a competitive QA-based baseline, which was previously only applied to bilingual data. Our model demonstrates good generalizability to three out-of-domain datasets and shows great utility in two downstream applications: automatic text simplification and sentence pair classification tasks.

pdf bib
KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization and Completion
Jie Zhou | Shengding Hu | Xin Lv | Cheng Yang | Zhiyuan Liu | Wei Xu | Jie Jiang | Juanzi Li | Maosong Sun
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Revisiting the Evaluation of End-to-end Event Extraction
Shun Zheng | Wei Cao | Wei Xu | Jiang Bian
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Generalizing Natural Language Analysis through Span-relation Representations
Zhengbao Jiang | Wei Xu | Jun Araki | Graham Neubig
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language processing covers a wide variety of tasks predicting syntax, semantics, and information content, and usually each type of output is generated with specially designed architectures. In this paper, we provide the simple insight that a great variety of tasks can be represented in a single unified format consisting of labeling spans and relations between spans, thus a single task-independent model can be used across different tasks. We perform extensive experiments to test this insight on 10 disparate tasks spanning dependency parsing (syntax), semantic role labeling (semantics), relation extraction (information content), aspect based sentiment analysis (sentiment), and many others, achieving performance comparable to state-of-the-art specialized models. We further demonstrate benefits of multi-task learning, and also show that the proposed method makes it easy to analyze differences and similarities in how the model handles different tasks. Finally, we convert these datasets into a unified format to build a benchmark, which provides a holistic testbed for evaluating future models for generalized natural language analysis.

pdf bib
Code and Named Entity Recognition in StackOverflow
Jeniya Tabassum | Mounica Maddela | Wei Xu | Alan Ritter
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F-1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. Our code and data are available at: https://github.com/jeniyat/StackOverflowNER/

pdf bib
Neural CRF Model for Sentence Alignment in Text Simplification
Chao Jiang | Mounica Maddela | Wuwei Lan | Yang Zhong | Wei Xu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.

pdf bib
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Wei Xu | Alan Ritter | Tim Baldwin | Afshin Rahimi
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

pdf bib
WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols
Jeniya Tabassum | Wei Xu | Alan Ritter
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

This paper presents the results of the wet labinformation extraction task at WNUT 2020.This task consisted of two sub tasks- (1) anamed entity recognition task with 13 partic-ipants; and (2) a relation extraction task with2 participants. We outline the task, data an-notation process, corpus statistics, and providea high-level overview of the participating sys-tems for each sub task.

pdf bib
An Empirical Study of Pre-trained Transformers for Arabic Information Extraction
Wuwei Lan | Yang Chen | Wei Xu | Alan Ritter
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multilingual pre-trained Transformers, such as mBERT (Devlin et al., 2019) and XLM-RoBERTa (Conneau et al., 2020a), have been shown to enable effective cross-lingual zero-shot transfer. However, their performance on Arabic information extraction (IE) tasks is not very well studied. In this paper, we pre-train a customized bilingual BERT, dubbed GigaBERT, that is designed specifically for Arabic NLP and English-to-Arabic zero-shot transfer learning. We study GigaBERT’s effectiveness on zero-short transfer across four IE tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction. Our best model significantly outperforms mBERT, XLM-RoBERTa, and AraBERT (Antoun et al., 2020) in both the supervised and zero-shot transfer settings. We have made our pre-trained models publicly available at: https://github.com/lanwuwei/GigaBERT.

2019

pdf bib
Doc2EDAG: An End-to-End Document-level Framework for Chinese Financial Event Extraction
Shun Zheng | Wei Cao | Wei Xu | Jiang Bian
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most existing event extraction (EE) methods merely extract event arguments within the sentence scope. However, such sentence-level EE methods struggle to handle soaring amounts of documents from emerging applications, such as finance, legislation, health, etc., where event arguments always scatter across different sentences, and even multiple such event mentions frequently co-exist in the same document. To address these challenges, we propose a novel end-to-end model, Doc2EDAG, which can generate an entity-based directed acyclic graph to fulfill the document-level EE (DEE) effectively. Moreover, we reformalize a DEE task with the no-trigger-words design to ease the document-level event labeling. To demonstrate the effectiveness of Doc2EDAG, we build a large-scale real-world dataset consisting of Chinese financial announcements with the challenges mentioned above. Extensive experiments with comprehensive analyses illustrate the superiority of Doc2EDAG over state-of-the-art methods. Data and codes can be found at https://github.com/dolphin-zs/Doc2EDAG.

pdf bib
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
Wei Xu | Alan Ritter | Tim Baldwin | Afshin Rahimi
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

pdf bib
DIAG-NRE: A Neural Pattern Diagnosis Framework for Distantly Supervised Neural Relation Extraction
Shun Zheng | Xu Han | Yankai Lin | Peilin Yu | Lu Chen | Ling Huang | Zhiyuan Liu | Wei Xu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Pattern-based labeling methods have achieved promising results in alleviating the inevitable labeling noises of distantly supervised neural relation extraction. However, these methods require significant expert labor to write relation-specific patterns, which makes them too sophisticated to generalize quickly. To ease the labor-intensive workload of pattern writing and enable the quick generalization to new relation types, we propose a neural pattern diagnosis framework, DIAG-NRE, that can automatically summarize and refine high-quality relational patterns from noise data with human experts in the loop. To demonstrate the effectiveness of DIAG-NRE, we apply it to two real-world datasets and present both significant and interpretable improvements over state-of-the-art methods.

pdf bib
Multi-task Pairwise Neural Ranking for Hashtag Segmentation
Mounica Maddela | Wei Xu | Daniel Preoţiuc-Pietro
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.

2018

pdf bib
Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering
Wuwei Lan | Wei Xu
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we analyze several neural network designs (and their variations) for sentence pair modeling and compare their performance extensively across eight datasets, including paraphrase identification, semantic textual similarity, natural language inference, and question answering tasks. Although most of these models have claimed state-of-the-art performance, the original papers often reported on only one or two selected datasets. We provide a systematic study and show that (i) encoding contextual information by LSTM and inter-sentence interactions are critical, (ii) Tree-LSTM does not help as much as previously claimed but surprisingly improves performance on Twitter datasets, (iii) the Enhanced Sequential Inference Model is the best so far for larger datasets, while the Pairwise Word Interaction Model achieves the best performance when less data is available. We release our implementations as an open-source toolkit.

pdf bib
An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols
Chaitanya Kulkarni | Wei Xu | Alan Ritter | Raghu Machiraju
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We describe an effort to annotate a corpus of natural language instructions consisting of 622 wet lab protocols to facilitate automatic or semi-automatic conversion of protocols into a machine-readable format and benefit biological research. Experimental results demonstrate the utility of our corpus for developing machine learning approaches to shallow semantic parsing of instructional texts. We make our annotated Wet Lab Protocol Corpus available to the research community.

pdf bib
Character-Based Neural Networks for Sentence Pair Modeling
Wuwei Lan | Wei Xu
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Sentence pair modeling is critical for many NLP tasks, such as paraphrase identification, semantic textual similarity, and natural language inference. Most state-of-the-art neural models for these tasks rely on pretrained word embedding and compose sentence-level semantics in varied ways; however, few works have attempted to verify whether we really need pretrained embeddings in these tasks. In this paper, we study how effective subword-level (character and character n-gram) representations are in sentence pair modeling. Though it is well-known that subword models are effective in tasks with single sentence input, including language modeling and machine translation, they have not been systematically studied in sentence pair modeling tasks where the semantic and string similarities between texts matter. Our experiments show that subword models without any pretrained word embedding can achieve new state-of-the-art results on two social media datasets and competitive results on news data for paraphrase identification.

pdf bib
Interactive Language Acquisition with One-shot Visual Concept Learning through a Conversational Game
Haichao Zhang | Haonan Yu | Wei Xu
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Building intelligent agents that can communicate with and learn from humans in natural language is of great value. Supervised language learning is limited by the ability of capturing mainly the statistics of training data, and is hardly adaptive to new scenarios or flexible for acquiring new knowledge without inefficient retraining or catastrophic forgetting. We highlight the perspective that conversational interaction serves as a natural interface both for language learning and for novel knowledge acquisition and propose a joint imitation and reinforcement approach for grounded language learning through an interactive conversational game. The agent trained with this approach is able to actively acquire information by asking questions about novel objects and use the just-learned knowledge in subsequent conversations in a one-shot fashion. Results compared with other methods verified the effectiveness of the proposed approach.

pdf bib
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Wei Xu | Alan Ritter | Tim Baldwin | Afshin Rahimi
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

pdf bib
A Word-Complexity Lexicon and A Neural Readability Ranking Model for Lexical Simplification
Mounica Maddela | Wei Xu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Current lexical simplification approaches rely heavily on heuristics and corpus level features that do not always align with human judgment. We create a human-rated word-complexity lexicon of 15,000 English words and propose a novel neural readability ranking model with a Gaussian-based feature vectorization layer that utilizes these human ratings to measure the complexity of any given word or phrase. Our model performs better than the state-of-the-art systems for different lexical simplification tasks and evaluation datasets. Additionally, we also produce SimplePPDB++, a lexical resource of over 10 million simplifying paraphrase rules, by applying our model to the Paraphrase Database (PPDB).

2017

pdf bib
Proceedings of the 3rd Workshop on Noisy User-generated Text
Leon Derczynski | Wei Xu | Alan Ritter | Tim Baldwin
Proceedings of the 3rd Workshop on Noisy User-generated Text

pdf bib
From Shakespeare to Twitter: What are Language Styles all about?
Wei Xu
Proceedings of the Workshop on Stylistic Variation

As natural language processing research is growing and largely driven by the availability of data, we expanded research from news and small-scale dialog corpora to web and social media. User-generated data and crowdsourcing opened the door for investigating human language of various styles with more statistical power and real-world applications. In this position/survey paper, I will review and discuss seven language styles that I believe to be important and interesting to study: influential work in the past, challenges at the present, and potential impact for the future.

pdf bib
A Continuously Growing Dataset of Sentential Paraphrases
Wuwei Lan | Siyu Qiu | Hua He | Wei Xu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.

2016

pdf bib
TweeTime : A Minimally Supervised Method for Recognizing and Normalizing Time Expressions in Twitter
Jeniya Tabassum | Alan Ritter | Wei Xu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
CFO: Conditional Focused Neural Question Answering with Large-scale Knowledge Bases
Zihang Dai | Lei Li | Wei Xu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Semi-Supervised Learning for Neural Machine Translation
Yong Cheng | Wei Xu | Zhongjun He | Wei He | Hua Wu | Maosong Sun | Yang Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
Bo Han | Alan Ritter | Leon Derczynski | Wei Xu | Tim Baldwin
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

pdf bib
Results of the WNUT16 Named Entity Recognition Shared Task
Benjamin Strauss | Bethany Toma | Alan Ritter | Marie-Catherine de Marneffe | Wei Xu
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

This paper presents the results of the Twitter Named Entity Recognition shared task associated with W-NUT 2016: a named entity tagging task with 10 teams participating. We outline the shared task, annotation process and dataset statistics, and provide a high-level overview of the participating systems for each shared task.

pdf bib
Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation
Jie Zhou | Ying Cao | Xuguang Wang | Peng Li | Wei Xu
Transactions of the Association for Computational Linguistics, Volume 4

Neural machine translation (NMT) aims at solving machine translation (MT) problems using neural networks and has exhibited promising results in recent years. However, most of the existing NMT models are shallow and there is still a performance gap between a single NMT model and the best conventional MT system. In this work, we introduce a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers. Fast-forward connections play an essential role in propagating the gradients and building a deep topology of depth 16. On the WMT’14 English-to-French task, we achieve BLEU=37.7 with a single attention model, which outperforms the corresponding single shallow model by 6.2 BLEU points. This is the first time that a single NMT model achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points. We can still achieve BLEU=36.3 even without using an attention mechanism. After special handling of unknown words and model ensembling, we obtain the best score reported to date on this task with BLEU=40.4. Our models are also validated on the more difficult WMT’14 English-to-German task.

pdf bib
Optimizing Statistical Machine Translation for Text Simplification
Wei Xu | Courtney Napoles | Ellie Pavlick | Quanze Chen | Chris Callison-Burch
Transactions of the Association for Computational Linguistics, Volume 4

Most recent sentence simplification systems use basic machine translation models to learn lexical and syntactic paraphrases from a manually simplified parallel corpus. These methods are limited by the quality and quantity of manually simplified corpora, which are expensive to build. In this paper, we conduct an in-depth adaptation of statistical machine translation to perform text simplification, taking advantage of large-scale paraphrases learned from bilingual texts and a small amount of manual simplifications with multiple references. Our work is the first to design automatic metrics that are effective for tuning and evaluating simplification systems, which will facilitate iterative development for this task.

2015

pdf bib
Proceedings of the Workshop on Noisy User-generated Text
Wei Xu | Bo Han | Alan Ritter
Proceedings of the Workshop on Noisy User-generated Text

pdf bib
Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition
Timothy Baldwin | Marie Catherine de Marneffe | Bo Han | Young-Bum Kim | Alan Ritter | Wei Xu
Proceedings of the Workshop on Noisy User-generated Text

pdf bib
SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT)
Wei Xu | Chris Callison-Burch | Bill Dolan
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
End-to-end learning of semantic role labeling using recurrent neural networks
Jie Zhou | Wei Xu
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Cost Optimization in Crowdsourcing Translation: Low cost translations made even cheaper
Mingkun Gao | Wei Xu | Chris Callison-Burch
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Problems in Current Text Simplification Research: New Data Can Help
Wei Xu | Chris Callison-Burch | Courtney Napoles
Transactions of the Association for Computational Linguistics, Volume 3

Simple Wikipedia has dominated simplification research in the past 5 years. In this opinion paper, we argue that focusing on Wikipedia limits simplification research. We back up our arguments with corpus analysis and by highlighting statements that other researchers have made in the simplification literature. We introduce a new simplification dataset that is a significant improvement over Simple Wikipedia, and present a novel quantitative-comparative approach to study the quality of simplification data resources.

2014

pdf bib
Extracting Lexically Divergent Paraphrases from Twitter
Wei Xu | Alan Ritter | Chris Callison-Burch | William B. Dolan | Yangfeng Ji
Transactions of the Association for Computational Linguistics, Volume 2

We present MultiP (Multi-instance Learning Paraphrase Model), a new model suited to identify paraphrases within the short messages on Twitter. We jointly model paraphrase relations between word and sentence pairs and assume only sentence-level annotations during learning. Using this principled latent variable model alone, we achieve the performance competitive with a state-of-the-art method which combines a latent space model with a feature-based supervised classifier. Our model also captures lexically divergent paraphrases that differ from yet complement previous methods; combining our model with previous work significantly outperforms the state-of-the-art. In addition, we present a novel annotation methodology that has allowed us to crowdsource a paraphrase corpus from Twitter. We make this new dataset available to the research community.

pdf bib
Infusion of Labeled Data into Distant Supervision for Relation Extraction
Maria Pershina | Bonan Min | Wei Xu | Ralph Grishman
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
A Preliminary Study of Tweet Summarization using Information Extraction
Wei Xu | Ralph Grishman | Adam Meyers | Alan Ritter
Proceedings of the Workshop on Language Analysis in Social Media

pdf bib
Gathering and Generating Paraphrases from Twitter with Application to Normalization
Wei Xu | Alan Ritter | Ralph Grishman
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Filling Knowledge Base Gaps for Distant Supervision of Relation Extraction
Wei Xu | Raphael Hoffmann | Le Zhao | Ralph Grishman
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Paraphrasing for Style
Wei Xu | Alan Ritter | Bill Dolan | Ralph Grishman | Colin Cherry
Proceedings of COLING 2012

2011

pdf bib
Exploiting Syntactic and Distributional Information for Spelling Correction with Web-Scale N-gram Models
Wei Xu | Joel Tetreault | Martin Chodorow | Ralph Grishman | Le Zhao
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Passage Retrieval for Information Extraction using Distant Supervision
Wei Xu | Ralph Grishman | Le Zhao
Proceedings of 5th International Joint Conference on Natural Language Processing

2009

pdf bib
Who, What, When, Where, Why? Comparing Multiple Approaches to the Cross-Lingual 5W Task
Kristen Parton | Kathleen R. McKeown | Bob Coyne | Mona T. Diab | Ralph Grishman | Dilek Hakkani-Tür | Mary Harper | Heng Ji | Wei Yun Ma | Adam Meyers | Sara Stolbach | Ang Sun | Gokhan Tur | Wei Xu | Sibel Yaman
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Automatic Recognition of Logical Relations for English, Chinese and Japanese in the GLARF Framework
Adam Meyers | Michiko Kosaka | Nianwen Xue | Heng Ji | Ang Sun | Shasha Liao | Wei Xu
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

pdf bib
A Parse-and-Trim Approach with Information Significance for Chinese Sentence Compression
Wei Xu | Ralph Grishman
Proceedings of the 2009 Workshop on Language Generation and Summarisation (UCNLG+Sum 2009)

pdf bib
Transducing Logical Relations from Automatic and Manual GLARF
Adam Meyers | Michiko Kosaka | Heng Ji | Nianwen Xue | Mary Harper | Ang Sun | Wei Xu | Shasha Liao
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

2007

pdf bib
Using Non-Local Features to Improve Named Entity Recognition Recall
Xinnian Mao | Wei Xu | Yuan Dong | Saike He | Haila Wang
Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation

2006

pdf bib
Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li | Mingli Wu | Qin Lu | Wei Xu | Chunfa Yuan
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2000

pdf bib
Task-based dialog management using an agenda
Wei Xu | Alexander I. Rudnicky
ANLP-NAACL 2000 Workshop: Conversational Systems

Search
Co-authors