Wei Liu


2021

pdf bib
Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter
Wei Liu | Xiyan Fu | Yue Zhang | Wenming Xiao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Lexicon information and pre-trained models, such as BERT, have been combined to explore Chinese sequence labeling tasks due to their respective strengths. However, existing methods solely fuse lexicon features via a shallow and random initialized sequence layer and do not integrate them into the bottom layers of BERT. In this paper, we propose Lexicon Enhanced BERT (LEBERT) for Chinese sequence labeling, which integrates external lexicon knowledge into BERT layers directly by a Lexicon Adapter layer. Compared with existing methods, our model facilitates deep lexicon knowledge fusion at the lower layers of BERT. Experiments on ten Chinese datasets of three tasks including Named Entity Recognition, Word Segmentation, and Part-of-Speech Tagging, show that LEBERT achieves state-of-the-art results.

pdf bib
UniKeyphrase: A Unified Extraction and Generation Framework for Keyphrase Prediction
Huanqin Wu | Wei Liu | Lei Li | Dan Nie | Tao Chen | Feng Zhang | Di Wang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Low-Resource Generation of Multi-hop Reasoning Questions
Jianxing Yu | Wei Liu | Shuang Qiu | Qinliang Su | Kai Wang | Xiaojun Quan | Jian Yin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper focuses on generating multi-hop reasoning questions from the raw text in a low resource circumstance. Such questions have to be syntactically valid and need to logically correlate with the answers by deducing over multiple relations on several sentences in the text. Specifically, we first build a multi-hop generation model and guide it to satisfy the logical rationality by the reasoning chain extracted from a given text. Since the labeled data is limited and insufficient for training, we propose to learn the model with the help of a large scale of unlabeled data that is much easier to obtain. Such data contains rich expressive forms of the questions with structural patterns on syntax and semantics. These patterns can be estimated by the neural hidden semi-Markov model using latent variables. With latent patterns as a prior, we can regularize the generation model and produce the optimal results. Experimental results on the HotpotQA data set demonstrate the effectiveness of our model. Moreover, we apply the generated results to the task of machine reading comprehension and achieve significant performance improvements.

pdf bib
CIST@CL-SciSumm 2020, LongSumm 2020: Automatic Scientific Document Summarization
Lei Li | Yang Xie | Wei Liu | Yinan Liu | Yafei Jiang | Siya Qi | Xingyuan Li
Proceedings of the First Workshop on Scholarly Document Processing

Our system participates in two shared tasks, CL-SciSumm 2020 and LongSumm 2020. In the CL-SciSumm shared task, based on our previous work, we apply more machine learning methods on position features and content features for facet classification in Task1B. And GCN is introduced in Task2 to perform extractive summarization. In the LongSumm shared task, we integrate both the extractive and abstractive summarization ways. Three methods were tested which are T5 Fine-tuning, DPPs Sampling, and GRU-GCN/GAT.

2019

pdf bib
Redcoat: A Collaborative Annotation Tool for Hierarchical Entity Typing
Michael Stewart | Wei Liu | Rachel Cardell-Oliver
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

We introduce Redcoat, a web-based annotation tool that supports collaborative hierarchical entity typing. As an annotation tool, Redcoat also facilitates knowledge elicitation by allowing the creation and continuous refinement of concept hierarchies during annotation. It aims to minimise not only annotation time but the time it takes for project creators to set up and distribute projects to annotators. Projects created using the web-based interface can be rapidly distributed to a list of email addresses. Redcoat handles the propagation of documents amongst annotators and automatically scales the annotation workload depending on the number of active annotators. In this paper we discuss these key features and outline Redcoat’s system architecture. We also highlight Redcoat’s unique benefits over existing annotation tools via a qualitative comparison.

pdf bib
An Encoding Strategy Based Word-Character LSTM for Chinese NER
Wei Liu | Tongge Xu | Qinghua Xu | Jiayu Song | Yueran Zu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

A recently proposed lattice model has demonstrated that words in character sequence can provide rich word boundary information for character-based Chinese NER model. In this model, word information is integrated into a shortcut path between the start and the end characters of the word. However, the existence of shortcut path may cause the model to degenerate into a partial word-based model, which will suffer from word segmentation errors. Furthermore, the lattice model can not be trained in batches due to its DAG structure. In this paper, we propose a novel word-character LSTM(WC-LSTM) model to add word information into the start or the end character of the word, alleviating the influence of word segmentation errors while obtaining the word boundary information. Four different strategies are explored in our model to encode word information into a fixed-sized representation for efficient batch training. Experiments on benchmark datasets show that our proposed model outperforms other state-of-the-arts models.

pdf bib
In Conclusion Not Repetition: Comprehensive Abstractive Summarization with Diversified Attention Based on Determinantal Point Processes
Lei Li | Wei Liu | Marina Litvak | Natalia Vanetik | Zuying Huang
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Various Seq2Seq learning models designed for machine translation were applied for abstractive summarization task recently. Despite these models provide high ROUGE scores, they are limited to generate comprehensive summaries with a high level of abstraction due to its degenerated attention distribution. We introduce Diverse Convolutional Seq2Seq Model(DivCNN Seq2Seq) using Determinantal Point Processes methods(Micro DPPs and Macro DPPs) to produce attention distribution considering both quality and diversity. Without breaking the end to end architecture, DivCNN Seq2Seq achieves a higher level of comprehensiveness compared to vanilla models and strong baselines. All the reproducible codes and datasets are available online.

pdf bib
Multi-lingual Wikipedia Summarization and Title Generation On Low Resource Corpus
Wei Liu | Lei Li | Zuying Huang | Yinan Liu
Proceedings of the Workshop MultiLing 2019: Summarization Across Languages, Genres and Sources

MultiLing 2019 Headline Generation Task on Wikipedia Corpus raised a critical and practical problem: multilingual task on low resource corpus. In this paper we proposed QDAS extractive summarization model enhanced by sentence2vec and try to apply transfer learning based on large multilingual pre-trained language model for Wikipedia Headline Generation task. We treat it as sequence labeling task and develop two schemes to handle with it. Experimental results have shown that large pre-trained model can effectively utilize learned knowledge to extract certain phrase using low resource supervised data.

2018

pdf bib
NovelPerspective: Identifying Point of View Characters
Lyndon White | Roberto Togneri | Wei Liu | Mohammed Bennamoun
Proceedings of ACL 2018, System Demonstrations

We present NovelPerspective: a tool to allow consumers to subset their digital literature, based on point of view (POV) character. Many novels have multiple main characters each with their own storyline running in parallel. A well-known example is George R. R. Martin’s novel: “A Game of Thrones”, and others from that series. Our tool detects the main character that each section is from the POV of, and allows the user to generate a new ebook with only those sections. This gives consumers new options in how they consume their media; allowing them to pursue the storylines sequentially, or skip chapters about characters they find boring. We present two heuristic-based baselines, and two machine learning based methods for the detection of the main character.

2016

pdf bib
Featureless Domain-Specific Term Extraction with Minimal Labelled Data
Rui Wang | Wei Liu | Chris McDonald
Proceedings of the Australasian Language Technology Association Workshop 2016

2014

pdf bib
Fast Tweet Retrieval with Compact Binary Codes
Weiwei Guo | Wei Liu | Mona Diab
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Two-Stage Hashing for Fast Document Retrieval
Hao Li | Wei Liu | Heng Ji
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2010

pdf bib
Detecting Word Misuse in Chinese
Wei Liu
Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics in a World of Social Media

pdf bib
Efficient Minimal Perfect Hash Language Models
David Guthrie | Mark Hepple | Wei Liu
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The availability of large collections of text have made it possible to build language models that incorporate counts of billions of n-grams. This paper proposes two new methods of efficiently storing large language models that allow O(1) random access and use significantly less space than all known approaches. We introduce two novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams using 2.5 Bytes per n-gram and language models of quantized probabilities using 2.26 Bytes per n-gram. These methods allow language processing applications to take advantage of much larger language models than previously was possible using the same hardware and we additionally describe how they can be used in a distributed environment to store even larger models. We show that our approaches are simple to implement and can easily be combined with pruning and quantization to achieve additional reductions in the size of the language model.

2008

pdf bib
Determining the Unithood of Word Sequences Using a Probabilistic Approach
Wilson Wong | Wei Liu | Mohammed Bennamoun
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Professor or Screaming Beast? Detecting Anomalous Words in Chinese
Wei Liu | Ben Allison | Louise Guthrie
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The Internet has become the most popular platform for communication. However because most of the modern computer keyboard is Latin-based, Asian languages such as Chinese cannot input its characters (Hanzi) directly with these keyboards. As a result, methods for representing Chinese characters using Latin alphabets were introduced. The most popular method among these is the Pinyin input system. Pinyin is also called “Romanised” Chinese in that it phonetically resembles a Chinese character. Due to the highly ambiguous mapping from Pinyin to Chinese characters, word misuses can occur using standard computer keyboard, and more commonly so in internet chat-rooms or instant messengers where the language used is less formal. In this paper we aim to develop a system that can automatically identify such anomalies, whether they are simple typos or whether they are intentional. After identifying them, the system should suggest the correct word to be used.

2006

pdf bib
A Closer Look at Skip-gram Modelling
David Guthrie | Ben Allison | Wei Liu | Louise Guthrie | Yorick Wilks
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.