Wenjie Li


2021

pdf bib
Event Graph based Sentence Fusion
Ruifeng Yuan | Zili Wang | Wenjie Li
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence fusion is a conditional generation task that merges several related sentences into a coherent one, which can be deemed as a summary sentence. The importance of sentence fusion has long been recognized by communities in natural language generation, especially in text summarization. It remains challenging for a state-of-the-art neural abstractive summarization model to generate a well-integrated summary sentence. In this paper, we explore the effective sentence fusion method in the context of text summarization. We propose to build an event graph from the input sentences to effectively capture and organize related events in a structured way and use the constructed event graph to guide sentence fusion. In addition to make use of the attention over the content of sentences and graph nodes, we further develop a graph flow attention mechanism to control the fusion process via the graph structure. When evaluated on sentence fusion data built from two summarization datasets, CNN/DaliyMail and Multi-News, our model shows to achieve state-of-the-art performance in terms of Rouge and other metrics like fusion rate and faithfulness.

pdf bib
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Effect Generation Based on Causal Reasoning
Feiteng Mu | Wenjie Li | Zhipeng Xie
Findings of the Association for Computational Linguistics: EMNLP 2021

Causal reasoning aims to predict the future scenarios that may be caused by the observed actions. However, existing causal reasoning methods deal with causalities on the word level. In this paper, we propose a novel event-level causal reasoning method and demonstrate its use in the task of effect generation. In particular, we structuralize the observed cause-effect event pairs into an event causality network, which describes causality dependencies. Given an input cause sentence, a causal subgraph is retrieved from the event causality network and is encoded with the graph attention mechanism, in order to support better reasoning of the potential effects. The most probable effect event is then selected from the causal subgraph and is used as guidance to generate an effect sentence. Experiments show that our method generates more reasonable effect sentences than various well-designed competitors.

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
PolyU CBS-Comp at SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Rong Xiang | Jinghang Gu | Emmanuele Chersoni | Wenjie Li | Qin Lu | Chu-Ren Huang
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

2020

pdf bib
Fact-level Extractive Summarization with Hierarchical Graph Mask on BERT
Ruifeng Yuan | Zili Wang | Wenjie Li
Proceedings of the 28th International Conference on Computational Linguistics

Most current extractive summarization models generate summaries by selecting salient sentences. However, one of the problems with sentence-level extractive summarization is that there exists a gap between the human-written gold summary and the oracle sentence labels. In this paper, we propose to extract fact-level semantic units for better extractive summarization. We also introduce a hierarchical structure, which incorporates the multi-level of granularities of the textual information into the model. In addition, we incorporate our model with BERT using a hierarchical graph mask. This allows us to combine BERT’s ability in natural language understanding and the structural information without increasing the scale of the model. Experiments on the CNN/DaliyMail dataset show that our model achieves state-of-the-art results.

2019

pdf bib
Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization
Hai Ye | Wenjie Li | Lu Wang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Semantic parsing aims to transform natural language (NL) utterances into formal meaning representations (MRs), whereas an NL generator achieves the reverse: producing an NL description for some given MRs. Despite this intrinsic connection, the two tasks are often studied separately in prior work. In this paper, we model the duality of these two tasks via a joint learning framework, and demonstrate its effectiveness of boosting the performance on both tasks. Concretely, we propose a novel method of dual information maximization (DIM) to regularize the learning process, where DIM empirically maximizes the variational lower bounds of expected joint distributions of NL and MRs. We further extend DIM to a semi-supervision setup (SemiDIM), which leverages unlabeled data of both tasks. Experiments on three datasets of dialogue management and code generation (and summarization) show that performance on both semantic parsing and NL generation can be consistently improved by DIM, in both supervised and semi-supervised setups.

2018

pdf bib
Query and Output: Generating Words by Querying Distributed Word Representations for Paraphrase Generation
Shuming Ma | Xu Sun | Wei Li | Sujian Li | Wenjie Li | Xuancheng Ren
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Most recent approaches use the sequence-to-sequence model for paraphrase generation. The existing sequence-to-sequence model tends to memorize the words and the patterns in the training dataset instead of learning the meaning of the words. Therefore, the generated sentences are often grammatically correct but semantically improper. In this work, we introduce a novel model based on the encoder-decoder framework, called Word Embedding Attention Network (WEAN). Our proposed model generates the words by querying distributed word representations (i.e. neural word embeddings), hoping to capturing the meaning of the according words. Following previous work, we evaluate our model on two paraphrase-oriented tasks, namely text simplification and short text abstractive summarization. Experimental results show that our model outperforms the sequence-to-sequence baseline by the BLEU score of 6.3 and 5.5 on two English text simplification datasets, and the ROUGE-2 F1 score of 5.7 on a Chinese summarization dataset. Moreover, our model achieves state-of-the-art performances on these three benchmark datasets.

pdf bib
Retrieve, Rerank and Rewrite: Soft Template Based Neural Summarization
Ziqiang Cao | Wenjie Li | Sujian Li | Furu Wei
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most previous seq2seq summarization systems purely depend on the source text to generate summaries, which tends to work unstably. Inspired by the traditional template-based summarization approaches, this paper proposes to use existing summaries as soft templates to guide the seq2seq model. To this end, we use a popular IR platform to Retrieve proper summaries as candidate templates. Then, we extend the seq2seq framework to jointly conduct template Reranking and template-aware summary generation (Rewriting). Experiments show that, in terms of informativeness, our model significantly outperforms the state-of-the-art methods, and even soft templates themselves demonstrate high competitiveness. In addition, the import of high-quality external summaries improves the stability and readability of generated summaries.

pdf bib
Unpaired Sentiment-to-Sentiment Translation: A Cycled Reinforcement Learning Approach
Jingjing Xu | Xu Sun | Qi Zeng | Xiaodong Zhang | Xuancheng Ren | Houfeng Wang | Wenjie Li
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The goal of sentiment-to-sentiment “translation” is to change the underlying sentiment of a sentence while keeping its content. The main challenge is the lack of parallel data. To solve this problem, we propose a cycled reinforcement learning method that enables training on unpaired data by collaboration between a neutralization module and an emotionalization module. We evaluate our approach on two review datasets, Yelp and Amazon. Experimental results show that our approach significantly outperforms the state-of-the-art systems. Especially, the proposed method substantially improves the content preservation performance. The BLEU score is improved from 1.64 to 22.46 and from 0.56 to 14.06 on the two datasets, respectively.

pdf bib
Variational Autoregressive Decoder for Neural Response Generation
Jiachen Du | Wenjie Li | Yulan He | Ruifeng Xu | Lidong Bing | Xuan Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Combining the virtues of probability graphic models and neural networks, Conditional Variational Auto-encoder (CVAE) has shown promising performance in applications such as response generation. However, existing CVAE-based models often generate responses from a single latent variable which may not be sufficient to model high variability in responses. To solve this problem, we propose a novel model that sequentially introduces a series of latent variables to condition the generation of each word in the response sequence. In addition, the approximate posteriors of these latent variables are augmented with a backward Recurrent Neural Network (RNN), which allows the latent variables to capture long-term dependencies of future tokens in generation. To facilitate training, we supplement our model with an auxiliary objective that predicts the subsequent bag of words. Empirical experiments conducted on Opensubtitle and Reddit datasets show that the proposed model leads to significant improvement on both relevance and diversity over state-of-the-art baselines.

pdf bib
NEXUS Network: Connecting the Preceding and the Following in Dialogue Generation
Xiaoyu Shen | Hui Su | Wenjie Li | Dietrich Klakow
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Sequence-to-Sequence (seq2seq) models have become overwhelmingly popular in building end-to-end trainable dialogue systems. Though highly efficient in learning the backbone of human-computer communications, they suffer from the problem of strongly favoring short generic responses. In this paper, we argue that a good response should smoothly connect both the preceding dialogue history and the following conversations. We strengthen this connection by mutual information maximization. To sidestep the non-differentiability of discrete natural language tokens, we introduce an auxiliary continuous code space and map such code space to a learnable prior distribution for generation purpose. Experiments on two dialogue datasets validate the effectiveness of our model, where the generated responses are closely related to the dialogue context and lead to more interactive conversations.

2017

pdf bib
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset
Yanran Li | Hui Su | Xiaoyu Shen | Wenjie Li | Ziqiang Cao | Shuzi Niu
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We develop a high-quality multi-turn dialog dataset, DailyDialog, which is intriguing in several aspects. The language is human-written and less noisy. The dialogues in the dataset reflect our daily communication way and cover various topics about our daily life. We also manually label the developed dataset with communication intention and emotion information. Then, we evaluate existing approaches on DailyDialog dataset and hope it benefit the research field of dialog systems. The dataset is available on http://yanran.li/dailydialog

pdf bib
Determining Gains Acquired from Word Embedding Quantitatively Using Discrete Distribution Clustering
Jianbo Ye | Yanran Li | Zhaohui Wu | James Z. Wang | Wenjie Li | Jia Li
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word embeddings have become widely-used in document analysis. While a large number of models for mapping words to vector spaces have been developed, it remains undetermined how much net gain can be achieved over traditional approaches based on bag-of-words. In this paper, we propose a new document clustering approach by combining any word embedding with a state-of-the-art algorithm for clustering empirical distributions. By using the Wasserstein distance between distributions, the word-to-word semantic relationship is taken into account in a principled way. The new clustering method is easy to use and consistently outperforms other methods on a variety of data sets. More importantly, the method provides an effective framework for determining when and how much word embeddings contribute to document analysis. Experimental results with multiple embedding models are reported.

pdf bib
A Conditional Variational Framework for Dialog Generation
Xiaoyu Shen | Hui Su | Yanran Li | Wenjie Li | Shuzi Niu | Yang Zhao | Akiko Aizawa | Guoping Long
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Deep latent variable models have been shown to facilitate the response generation for open-domain dialog systems. However, these latent variables are highly randomized, leading to uncontrollable generated responses. In this paper, we propose a framework allowing conditional response generation based on specific attributes. These attributes can be either manually assigned or automatically detected. Moreover, the dialog states for both speakers are modeled separately in order to reflect personal features. We validate this framework on two different scenarios, where the attribute refers to genericness and sentiment states respectively. The experiment result testified the potential of our model, where meaningful responses can be generated in accordance with the specified attributes.

pdf bib
Improving Semantic Relevance for Sequence-to-Sequence Learning of Chinese Social Media Text Summarization
Shuming Ma | Xu Sun | Jingjing Xu | Houfeng Wang | Wenjie Li | Qi Su
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Current Chinese social media text summarization models are based on an encoder-decoder framework. Although its generated summaries are similar to source texts literally, they have low semantic relevance. In this work, our goal is to improve semantic relevance between source texts and summaries for Chinese social media summarization. We introduce a Semantic Relevance Based neural model to encourage high semantic similarity between texts and summaries. In our model, the source text is represented by a gated attention encoder, while the summary representation is produced by a decoder. Besides, the similarity score between the representations is maximized during training. Our experiments show that the proposed model outperforms baseline systems on a social media corpus.

2016

pdf bib
PolyU at CL-SciSumm 2016
Ziqiang Cao | Wenjie Li | Dapeng Wu
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Ziqiang Cao | Wenjie Li | Sujian Li | Furu Wei | Yanran Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Query relevance ranking and sentence saliency ranking are the two main tasks in extractive query-focused summarization. Previous supervised summarization systems often perform the two tasks in isolation. However, since reference summaries are the trade-off between relevance and saliency, using them as supervision, neither of the two rankers could be trained well. This paper proposes a novel summarization system called AttSum, which tackles the two tasks jointly. It automatically learns distributed representations for sentences as well as the document cluster. Meanwhile, it applies the attention mechanism to simulate the attentive reading of human behavior when a query is given. Extensive experiments are conducted on DUC query-focused summarization benchmark datasets. Without using any hand-crafted features, AttSum achieves competitive performance. We also observe that the sentences recognized to focus on the query indeed meet the query need.

pdf bib
Content-based Influence Modeling for Opinion Behavior Prediction
Chengyao Chen | Zhitao Wang | Yu Lei | Wenjie Li
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Nowadays, social media has become a popular platform for companies to understand their customers. It provides valuable opportunities to gain new insights into how a person’s opinion about a product is influenced by his friends. Though various approaches have been proposed to study the opinion formation problem, they all formulate opinions as the derived sentiment values either discrete or continuous without considering the semantic information. In this paper, we propose a Content-based Social Influence Model to study the implicit mechanism underlying the change of opinions. We then apply the learned model to predict users’ future opinions. The advantages of the proposed model is the ability to handle the semantic information and to learn two influence components including the opinion influence of the content information and the social relation factors. In the experiments conducted on Twitter datasets, our model significantly outperforms other popular opinion formation models.

pdf bib
Emotion Corpus Construction Based on Selection from Hashtags
Minglei Li | Yunfei Long | Lu Qin | Wenjie Li
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The availability of labelled corpus is of great importance for supervised learning in emotion classification tasks. Because it is time-consuming to manually label text, hashtags have been used as naturally annotated labels to obtain a large amount of labelled training data from microblog. However, natural hashtags contain too much noise for it to be used directly in learning algorithms. In this paper, we design a three-stage semi-automatic method to construct an emotion corpus from microblogs. Firstly, a lexicon based voting approach is used to verify the hashtag automatically. Secondly, a SVM based classifier is used to select the data whose natural labels are consistent with the predicted labels. Finally, the remaining data will be manually examined to filter out the noisy data. Out of about 48K filtered Chinese microblogs, 39k microblogs are selected to form the final corpus with the Kappa value reaching over 0.92 for the automatic parts and over 0.81 for the manual part. The proportion of automatic selection reaches 54.1%. Thus, the method can reduce about 44.5% of manual workload for acquiring quality data. Experiment on a classifier trained on this corpus shows that it achieves comparable results compared to the manually annotated NLP&CC2013 corpus.

2015

pdf bib
Learning to Adapt Credible Knowledge in Cross-lingual Sentiment Analysis
Qiang Chen | Wenjie Li | Yu Lei | Xule Liu | Yanxiang He
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
A Hierarchical Knowledge Representation for Expert Finding on Social Media
Yanran Li | Wenjie Li | Sujian Li
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Learning Summary Prior Representation for Extractive Summarization
Ziqiang Cao | Furu Wei | Sujian Li | Wenjie Li | Ming Zhou | Houfeng Wang
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Component-Enhanced Chinese Character Embeddings
Yanran Li | Wenjie Li | Fei Sun | Sujian Li
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Reinforcing the Topic of Embeddings with Theta Pure Dependence for Text Classification
Ning Xing | Yuexian Hou | Peng Zhang | Wenjie Li | Dawei Song
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Cross-lingual Sentiment Lexicon Learning With Bilingual Word Graph Label Propagation
Dehong Gao | Furu Wei | Wenjie Li | Xiaohua Liu | Ming Zhou
Computational Linguistics, Volume 41, Issue 1 - March 2015

2014

pdf bib
Feature-Frequency–Adaptive On-line Training for Fast and Accurate Natural Language Processing
Xu Sun | Wenjie Li | Houfeng Wang | Qin Lu
Computational Linguistics, Volume 40, Issue 3 - September 2014

pdf bib
Text-level Discourse Dependency Parsing
Sujian Li | Liang Wang | Ziqiang Cao | Wenjie Li
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Sequential Summarization: A New Application for Timely Updated Twitter Trending Topics
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Generalized Abbreviation Prediction with Negative Full Forms and Its Application on Improving Chinese Web Search
Xu Sun | Wenjie Li | Fanqi Meng | Houfeng Wang
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib
Towards Scalable Speech Act Recognition in Twitter: Tackling Insufficient Training Data
Renxian Zhang | Dehong Gao | Wenjie Li
Proceedings of the Workshop on Semantic Analysis in Social Media

pdf bib
The CIPS-SIGHAN CLP 2012 ChineseWord Segmentation onMicroBlog Corpora Bakeoff
Huiming Duan | Zhifang Sui | Ye Tian | Wenjie Li
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
Implicit Discourse Relation Recognition by Selecting Typical Training Examples
Xun Wang | Sujian Li | Jiwei Li | Wenjie Li
Proceedings of COLING 2012

pdf bib
Fine-Grained Classification of Named Entities by Fusing Multi-Features
Wenjie Li | Jiwei Li | Ye Tian | Zhifang Sui
Proceedings of COLING 2012: Posters

pdf bib
Efficient Feedback-based Feature Learning for Blog Distillation as a Terabyte Challenge
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Beyond Twitter Text: A Preliminary Study on Twitter Hyperlink and its Application
Dehong Gao | Wenjie Li | Renxian Zhang
Proceedings of COLING 2012: Demonstration Papers

pdf bib
Fast Online Training with Frequency-Adaptive Learning Rates for Chinese Word Segmentation and New Word Detection
Xu Sun | Houfeng Wang | Wenjie Li
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2011

pdf bib
Simultaneous Clustering and Noise Detection for Theme-based Summarization
Xiaoyan Cai | Renxian Zhang | Dehong Gao | Wenjie Li
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib
273. Task 5. Keyphrase Extraction Based on Core Word Identification and Word Expansion
You Ouyang | Wenjie Li | Renxian Zhang
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases through a Document Semantic Network
Decong Li | Sujian Li | Wenjie Li | Wei Wang | Weiguang Qu
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Simultaneous Ranking and Clustering of Sentences: A Reinforcement Approach to Multi-Document Summarization
Xiaoyan Cai | Wenjie Li | You Ouyang | Hong Yan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
A Study on Position Information in Document Summarization
You Ouyang | Wenjie Li | Qin Lu | Renxian Zhang
Coling 2010: Posters

pdf bib
Sentence Ordering with Event-Enriched Semantics and Two-Layered Clustering for Multi-Document News Summarization
Renxian Zhang | Wenjie Li | Qin Lu
Coling 2010: Posters

pdf bib
Using Deep Belief Nets for Chinese Named Entity Categorization
Yu Chen | You Ouyang | Wenjie Li | Dequan Zheng | Tiejun Zhao
Proceedings of the 2010 Named Entities Workshop

pdf bib
Exploring Deep Belief Network for Chinese Relation Extraction
Yu Chen | Wenjie Li | Yan Liu | Dequan Zheng | Tiejun Zhao
CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
The Chinese Persons Name Diambiguation Evaluation: Exploration of Personal Name Disambiguation in Chinese News
Ying Chen | Peng Jin | Wenjie Li | Chu-Ren Huang
CIPS-SIGHAN Joint Conference on Chinese Language Processing

2009

pdf bib
An Integrated Multi-document Summarization Approach based on Word Hierarchical Representation
You Ouyang | Wenjie Li | Qin Lu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Co-Feedback Ranking for Query-Focused Summarization
Furu Wei | Wenjie Li | Yanxiang He
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf bib
Chinese Core Ontology Construction from a Bilingual Term Bank
Yirong Chen | Qin Lu | Wenjie Li | Gaoying Cui
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

A core ontology is a mid-level ontology which bridges the gap between an upper ontology and a domain ontology. Automatic Chinese core ontology construction can help quickly model domain knowledge. A graph based core ontology construction algorithm (COCA) is proposed to automatically construct a core ontology from an English-Chinese bilingual term bank. This algorithm computes the mapping strength from a selected Chinese term to WordNet synset with association to an upper-level SUMO concept. The strength is measured using a graph model integrated with several mapping features from multiple information sources. The features include multiple translation feature between Chinese core term and WordNet, extended string feature and Part-of-Speech feature. Evaluation of COCA repeated on an English-Chinese bilingual Term bank with more than 130K entries shows that the algorithm is improved in performance compared with our previous research and can better serve the semi-automatic construction of mid-level ontology.

pdf bib
Corpus Exploitation from Wikipedia for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li | Yirong Chen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Ontology construction usually requires a domain-specific corpus for building corresponding concept hierarchy. The domain corpus must have a good coverage of domain knowledge. Wikipedia(Wiki), the world’s largest online encyclopaedic knowledge source, is open-content, collaboratively edited, and free of charge. It covers millions of articles and still keeps on expanding continuously. These characteristics make Wiki a good candidate as domain corpus resource in ontology construction. However, the selected article collection must have considerable quality and quantity. In this paper, a novel approach is proposed to identify articles in Wiki as domain-specific corpus by using available classification information in Wiki pages. The main idea is to generate a domain hierarchy from the hyperlinked pages of Wiki. Only articles strongly linked to this hierarchy are selected as the domain corpus. The proposed approach makes use of linked category information in Wiki pages to produce the hierarchy as a directed graph for obtaining a set of pages in the same connected branch. Ranking and filtering are then done on these pages based on the classification tree generated by the traversal algorithm. The experiment and evaluation results show that Wiki is a good resource for acquiring a relative high quality domain-specific corpus for ontology construction.

pdf bib
Exploiting the Role of Position Feature in Chinese Relation Extraction
Peng Zhang | Wenjie Li | Furu Wei | Qin Lu | Yuexian Hou
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Relation extraction is the task of finding pre-defined semantic relations between two entities or entity mentions from text. Many methods, such as feature-based and kernel-based methods, have been proposed in the literature. Among them, feature-based methods draw much attention from researchers. However, to the best of our knowledge, existing feature-based methods did not explicitly incorporate the position feature and no in-depth analysis was conducted in this regard. In this paper, we define and exploit nine types of position information between two named entity mentions and then use it along with other features in a multi-class classification framework for Chinese relation extraction. Experiments on the ACE 2005 data set show that the position feature is more effective than the other recognized features like entity type/subtype and character-based N-gram context. Most important, it can be easily captured and does not require as much effort as applying deep natural language processing.

pdf bib
Opinion Annotation in On-line Chinese Product Reviews
Ruifeng Xu | Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the design and construction of a Chinese opinion corpus based on the online product reviews. Based on the observation on the characteristics of opinion expression in Chinese online product reviews, which is quite different from in the formal texts such as news, an annotation framework is proposed to guide the construction of the first Chinese opinion corpus based on online product reviews. The opinionated sentences are manually identified from the review text. Furthermore, for each comment in the opinionated sentence, its 13 describing elements are annotated including the expressions related to the interested product attributes and user opinions as well as the polarity and degree of the opinions. Currently, 12,724 comments are annotated in 10,935 sentences from review text. Through statistical analysis on the opinion corpus, some interesting characteristics of Chinese opinion expression are presented. This corpus is shown helpful to support systematic research on Chinese opinion analysis.

pdf bib
Preliminary Chinese Term Classification for Ontology Construction
Gaoying Cui | Qin Lu | Wenjie Li
Proceedings of the 6th Workshop on Asian Language Resources

pdf bib
PNR2: Ranking Sentences with Positive and Negative Reinforcement for Query-Oriented Update Summarization
Wenjie Li | Furu Wei | Qin Lu | Yanxiang He
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Extractive Summarization Using Supervised and Semi-Supervised Learning
Kam-Fai Wong | Mingli Wu | Wenjie Li
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
A Novel Feature-based Approach to Chinese Entity Relation Extraction
Wenjie Li | Peng Zhang | Furu Wei | Yuexian Hou | Qin Lu
Proceedings of ACL-08: HLT, Short Papers

2007

pdf bib
Extractive Summarization Based on Event Term Clustering
Maofu Liu | Wenjie Li | Mingli Wu | Qin Lu
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
Annotating Chinese Collocations with Multi Information
Ruifeng Xu | Qin Lu | Kam-Fai Wong | Wenjie Li
Proceedings of the Linguistic Annotation Workshop

2006

pdf bib
Constructing A Chinese Chat Language Corpus with A Two-Stage Incremental Annotation Approach
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Chat language refers to the special human language widely used in the community of digital network chat. As chat language holds anomalous characteristics in forming words, phrases, and non-alphabetical characters, conventional natural language processing tools are ineffective to handle chat language text. Previous research shows that knowledge based methods perform less effectively in proc-essing unseen chat terms. This motivates us to construct a chat language corpus so that corpus-based techniques of chat language text processing can be developed and evaluated. However, creating the corpus merely by hand is difficult. One, this work is manpower consuming. Second, annotation inconsistency is serious. To minimize manpower and annotation inconsistency, a two-stage incre-mental annotation approach is proposed in this paper in constructing a chat language corpus. Experiments conducted in this paper show that the performance of corpus annotation can be improved greatly with this approach.

pdf bib
Mining Implicit Entities in Queries
Wei Li | Wenjie Li | Qin Lu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Entities are pivotal in describing events and objects, and also very important in Document Summarization. In general only explicit entities which can be extracted by a Named Entity Recognizer are used in real applications. However, implicit entities hidden behind the phrases or words, e.g. entity referred by the phrase “cross border”, are proved to be helpful in Document Summarization. In our experiment, we extract the implicit entities from the web resources.

pdf bib
A Study on Terminology Extraction Based on Classified Corpora
Yirong Chen | Qin Lu | Wenjie Li | Zhifang Sui | Luning Ji
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Algorithms for automatic term extraction in a specific domain should consider at least two issues, namely Unithood and Termhood (Kageura, 1996). Unithood refers to the degree of a string to occur as a word or a phrase. Termhood (Chen Yirong, 2005) refers to the degree of a word or a phrase to occur as a domain specific concept. Unlike unithood, study on termhood is not yet widely reported. In classified corpora, the class information provides the cue to the nature of data and can be used in termhood calculation. Three algorithms are provided and evaluated to investigate termhood based on classified corpora. The three algorithms are based on lexicon set computing, term frequency and document frequency, and the strength of the relation between a term and its document class respectively. Our objective is to investigate the effects of these different termhood measurement features. After evaluation, we can find which features are more effective and also, how we can improve these different features to achieve the best performance. Preliminary results show that the first measure can effectively filter out independent terms or terms of general use.

pdf bib
Interaction between Lexical Base and Ontology with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li | Ruifeng Xu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

An ontology describes conceptual knowledge in a specific domain. A lexical base collects a repository of words and gives independent definition of concepts. In this paper, we propose to use FCA as a tool to help constructing an ontology through an existing lexical base. We mainly address two issues. The first issue is how to select attributes to visualize the relations between lexical terms. The second issue is how to revise lexical definitions through analysing the relations in the ontology. Thus the focus is on the effect of interaction between a lexical base and an ontology for the purpose of good ontology construction. Finally, experiments have been conducted to verify our ideas.

pdf bib
Extractive Summarization using Inter- and Intra- Event Relevance
Wenjie Li | Mingli Wu | Qin Lu | Wei Xu | Chunfa Yuan
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
A Phonetic-Based Approach to Chinese Chat Text Normalization
Yunqing Xia | Kam-Fai Wong | Wenjie Li
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
A Comparative Study of the Effect of Word Segmentation On Chinese Terminology Extraction
Luning Ji | Qin Lu | Wenjie Li | YiRong Chen
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

2005

pdf bib
A Preliminary Work on Classifying Time Granularities of Temporal Questions
Wei Li | Wenjie Li | Qin Lu | Kam-Fai Wong
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
CTEMP: A Chinese Temporal Parser for Extracting and Normalizing Temporal Information
Mingli Wu | Wenjie Li | Qin Lu | Baoli Li
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Integrating Collocation Features in Chinese Word Sense Disambiguation
Wanyin Li | Qin Lu | Wenjie Li
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

pdf bib
Experiments of Ontology Construction with Formal Concept Analysis
Sujian Li | Qin Lu | Wenjie Li
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources

2004

pdf bib
Applying Machine Learning to Chinese Temporal Relation Resolution
Wenjie Li | Kam-Fai Wong | Guihong Cao | Chunfa Yuan
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

pdf bib
Combining Linguistic Features with Weighted Bayesian Classifier for Temporal Reference Processing
Guihong Cao | Wenjie Li | Kam-Fai Wong | Chunfa Yuan
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2002

pdf bib
An Indexing Method Based on Sentences
Li Li | Chunfa Yuan | K.F. Wong | Wenjie Li
COLING-02: The First SIGHAN Workshop on Chinese Language Processing

2001

pdf bib
A Model For Processing Temporal References In Chinese
Wenjie Li | Kam-Fai Wong | Chunfa Yuan
Proceedings of the ACL 2001 Workshop on Temporal and Spatial Information Processing

2000

pdf bib
An Algorithm for Situation Classification of Chinese Verbs
Xiaodan Zhu | Chunfa Yuan | K.F. Wong | Wenjie Li
Second Chinese Language Processing Workshop

1995

pdf bib
Are Statistics-Based Approaches Good Enough For NLP? A Case Study Of Maximal-Length NP Extraction In Mandarin Chinese
Wenjie Li | Haihua Pan | Ming Zhou | Kam-Fai Wong | Vincent Lum
Proceedings of Rocling VIII Computational Linguistics Conference VIII