Wei-Yun Ma

Also published as: Wei Yun Ma

2024

pdf bib abs
Plug-in Language Model: Controlling Text Generation with a Simple Regression Model
Nai-Chi Yang | Wei-Yun Ma | Pu-Jen Cheng
Findings of the Association for Computational Linguistics: NAACL 2024

Large-scale pre-trained language models have displayed unrivaled capacity in generating text that closely resembles human-written text. Nevertheless, generating texts adhering to specific conditions without fine-tuning or adding new parameters can be challenging. Contemporary approaches commonly rely on either prompts or auxiliary models to avoid modifying the language models. These auxiliary models are designed to assess whether a generated token contributes to meeting the desired requirements. These approaches adjust the distribution of the next token during the inference phase by leveraging the prediction score of the desired attribute to calculate gradients. However, these auxiliary models typically require the language model’s latent states. This prerequisite challenges integrating various existing black box attribute models or tools. We present the Plug-in Language Model (PiLM) as a solution to address the limitations. PiLM leverages reinforcement learning to utilize black box tools directly, adjusting the latent state to control text generation. However, performing backpropagation during the inference phase is time-consuming for PiLM. By replacing backpropagation with a simple regression model, PiLM can achieve an inference time comparable to that of the original LLM. Experiment results show that our approaches in this paper outperform existing state-of-the-art methods that rely on gradient-based, weighted decoding, or prompt-based methodologies.

pdf bib abs
Automatic Construction of a Chinese Review Dataset for Aspect Sentiment Triplet Extraction via Iterative Weak Supervision
Chia-Wen Lu | Ching-Wen Yang | Wei-Yun Ma
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Aspect Sentiment Triplet Extraction (ASTE), introduced in 2020, is a task that involves the extraction of three key elements: target aspects, descriptive opinion spans, and their corresponding sentiment polarity. This process, however, faces a significant hurdle, particularly when applied to Chinese languages, due to the lack of sufficient datasets for model training, largely attributable to the arduous manual labeling process. To address this issue, we present an innovative framework that facilitates the automatic construction of ASTE via Iterative Weak Supervision, negating the need for manual labeling, aided by a discriminator to weed out subpar samples. The objective is to successively improve the quality of this raw data and generate supplementary data. The effectiveness of our approach is underscored by our results, which include the creation of a substantial Chinese review dataset. This dataset encompasses over 60,000 Google restaurant reviews in Chinese and features more than 200,000 extracted triplets. Moreover, we have also established a robust baseline model by leveraging a novel method of weak supervision. Both our dataset and model are openly accessible to the public.

pdf bib abs
Generating Attractive and Authentic Copywriting from Customer Reviews
Yu-Xiang Lin | Wei-Yun Ma
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The goal of product copywriting is to capture the interest of potential buyers by emphasizing the features of products through text descriptions. As e-commerce platforms offer a wide range of services, it’s becoming essential to dynamically adjust the styles of these auto-generated descriptions. Typical approaches to copywriting generation often rely solely on specified product attributes, which may result in dull and repetitive content. To tackle this issue, we propose to generate copywriting based on customer reviews, as they provide firsthand practical experiences with products, offering a richer source of information than just product attributes. We have developed a sequence-to-sequence framework, enhanced with reinforcement learning, to produce copywriting that is attractive, authentic, and rich in information. Our framework outperforms all existing baseline and zero-shot large language models, including LLaMA-2-chat-7B and GPT-3.5, in terms of both attractiveness and faithfulness. Furthermore, this work features the use of LLMs for aspect-based summaries collection and argument allure assessment. Experiments demonstrate the effectiveness of using LLMs for marketing domain corpus construction. The code and the dataset is publicly available at: https://github.com/YuXiangLin1234/Copywriting-Generation.

2022

pdf bib abs
Converting the Sinica Treebank of Mandarin Chinese to Universal Dependencies
Yu-Ming Hsieh | Yueh-Yin Shih | Wei-Yun Ma
Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022

This paper describes the conversion of the Sinica Treebank, one of the major Mandarin Chinese treebanks, to Universal Dependencies. The conversion is rule-based and the process involves POS tag mapping, head adjusting in line with the UD scheme and the dependency conversion. Linguistic insights into Mandarin Chinese alongwith the conversion are also discussed. The resulting corpus is the UD Chinese Sinica Treebank which contains more than fifty thousand tree structures according to the UD scheme. The dataset can be downloaded at https://github.com/ckiplab/ud.

pdf bib abs
HanTrans: An Empirical Study on Cross-Era Transferability of Chinese Pre-trained Language Model
Chin-Tung Lin | Wei-Yun Ma
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

The pre-trained language model has recently dominated most downstream tasks in the NLP area. Particularly, bidirectional Encoder Representations from Transformers (BERT) is the most iconic pre-trained language model among the NLP tasks. Their proposed masked-language modeling (MLM) is an indispensable part of the existing pre-trained language models. Those outperformed models for downstream tasks benefited directly from the large training corpus in the pre-training stage. However, their training corpus for modern traditional Chinese was light. Most of all, the ancient Chinese corpus is still disappearance in the pre-training stage. Therefore, we aim to address this problem by transforming the annotation data of ancient Chinese into BERT style training corpus. Then we propose a pre-trained Oldhan Chinese BERT model for the NLP community. Our proposed model outperforms the original BERT model by significantly reducing perplexity scores in masked-language modeling (MLM). Also, our fine-tuning models improve F1 scores on word segmentation and part-of-speech tasks. Then we comprehensively study zero-shot cross-eras ability in the BERT model. Finally, we visualize and investigate personal pronouns in the embedding space of ancient Chinese records from four eras. We have released our code at https://github.com/ckiplab/han-transformers.

2021

pdf bib
H-FND: Hierarchical False-Negative Denoising for Distant Supervision Relation Extraction
Jhih-wei Chen | Tsu-Jui Fu | Chen-Kang Lee | Wei-Yun Ma
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

In this paper, we design headword-oriented entity linking (HEL), a specialized entity linking problem in which only the headwords of the entities are to be linked to knowledge bases; mention scopes of the entities do not need to be identified in the problem setting. This special task is motivated by the fact that in many articles referring to specific products, the complete full product names are rarely written; instead, they are often abbreviated to shorter, irregular versions or even just to their headwords, which are usually their product types, such as “stick” or “mask” in a cosmetic context. To fully design the special task, we construct a labeled cosmetic corpus as a public benchmark for this problem, and propose a product embedding model to address the task, where each product corresponds to a dense representation to encode the different information on products and their context jointly. Besides, to increase training data, we propose a special transfer learning framework in which distant supervision with heuristic patterns is first utilized, followed by supervised learning using a small amount of manually labeled data. The experimental results show that our model provides a strong benchmark performance on the special task.

pdf bib abs
CA-EHN: Commonsense Analogy from E-HowNet
Peng-Hsuan Li | Tsan-Yu Yang | Wei-Yun Ma
Proceedings of the Twelfth Language Resources and Evaluation Conference

Embedding commonsense knowledge is crucial for end-to-end models to generalize inference beyond training corpora. However, existing word analogy datasets have tended to be handcrafted, involving permutations of hundreds of words with only dozens of pre-defined relations, mostly morphological relations and named entities. In this work, we model commonsense knowledge down to word-level analogical reasoning by leveraging E-HowNet, an ontology that annotates 88K Chinese words with their structured sense definitions and English translations. We present CA-EHN, the first commonsense word analogy dataset containing 90,505 analogies covering 5,656 words and 763 relations. Experiments show that CA-EHN stands out as a great indicator of how well word representations embed commonsense knowledge. The dataset is publicly available at https://github.com/ckiplab/CA-EHN.

pdf bib abs
Semantic Guidance of Dialogue Generation with Reinforcement Learning
Cheng-Hsun Hsueh | Wei-Yun Ma
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Neural encoder-decoder models have shown promising performance for human-computer dialogue systems over the past few years. However, due to the maximum-likelihood objective for the decoder, the generated responses are often universal and safe to the point that they lack meaningful information and are no longer relevant to the post. To address this, in this paper, we propose semantic guidance using reinforcement learning to ensure that the generated responses indeed include the given or predicted semantics and that these semantics do not appear repeatedly in the response. Synsets, which comprise sets of manually defined synonyms, are used as the form of assigned semantics. For a given/assigned/predicted synset, only one of its synonyms should appear in the generated response; this constitutes a simple but effective semantic-control mechanism. We conduct both quantitative and qualitative evaluations, which show that the generated responses are not only higher-quality but also reflect the assigned semantic controls.

2019

pdf bib abs
iComposer: An Automatic Songwriting System for Chinese Popular Music
Hsin-Pei Lee | Jhih-Sheng Fang | Wei-Yun Ma
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

In this paper, we introduce iComposer, an interactive web-based songwriting system designed to assist human creators by greatly simplifying music production. iComposer automatically creates melodies to accompany any given text. It also enables users to generate a set of lyrics given arbitrary melodies. iComposer is based on three sequence-to-sequence models, which are used to predict melody, rhythm, and lyrics, respectively. Songs generated by iComposer are compared with human-composed and randomly-generated ones in a subjective test, the experimental results of which demonstrate the capability of the proposed system to write pleasing melodies and meaningful lyrics at a level similar to that of humans.

pdf bib abs
GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction
Tsu-Jui Fu | Peng-Hsuan Li | Wei-Yun Ma
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we present GraphRel, an end-to-end relation extraction model which uses graph convolutional networks (GCNs) to jointly learn named entities and relations. In contrast to previous baselines, we consider the interaction between named entities and relations via a 2nd-phase relation-weighted GCN to better extract relations. Linear and dependency structures are both used to extract both sequential and regional features of the text, and a complete word graph is further utilized to extract implicit features among all word pairs of the text. With the graph-based approach, the prediction for overlapping relations is substantially improved over previous sequential approaches. We evaluate GraphRel on two public datasets: NYT and WebNLG. Results show that GraphRel maintains high precision while increasing recall substantially. Also, GraphRel outperforms previous work by 3.2% and 5.8% (F1 score), achieving a new state-of-the-art for relation extraction.

2018

pdf bib abs
Speed Reading: Learning to Read ForBackward via Shuttle
Tsu-Jui Fu | Wei-Yun Ma
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present LSTM-Shuttle, which applies human speed reading techniques to natural language processing tasks for accurate and efficient comprehension. In contrast to previous work, LSTM-Shuttle not only reads shuttling forward but also goes back. Shuttling forward enables high efficiency, and going backward gives the model a chance to recover lost information, ensuring better prediction. We evaluate LSTM-Shuttle on sentiment analysis, news classification, and cloze on IMDB, Rotten Tomatoes, AG, and Children’s Book Test datasets. We show that LSTM-Shuttle predicts both better and more quickly. To demonstrate how LSTM-Shuttle actually behaves, we also analyze the shuttling operation and present a case study.

pdf bib
Word Embedding Evaluation Datasets and Wikipedia Title Embedding for Chinese
Chi-Yen Chen | Wei-Yun Ma
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Extended HowNet 2.0 – An Entity-Relation Common-Sense Representation Model
Wei-Yun Ma | Yueh-Yin Shih
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Leveraging Linguistic Structures for Named Entity Recognition with Bidirectional Recursive Neural Networks
Peng-Hsuan Li | Ruo-Ping Dong | Yu-Siang Wang | Ju-Chieh Chou | Wei-Yun Ma
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we utilize the linguistic structures of texts to improve named entity recognition by BRNN-CNN, a special bidirectional recursive network attached with a convolutional network. Motivated by the observation that named entities are highly related to linguistic constituents, we propose a constituent-based BRNN-CNN for named entity recognition. In contrast to classical sequential labeling methods, the system first identifies which text chunks are possible named entities by whether they are linguistic constituents. Then it classifies these chunks with a constituency tree structure by recursively propagating syntactic and semantic information to each constituent node. This method surpasses current state-of-the-art on OntoNotes 5.0 with automatically generated parses.

pdf bib abs
Integrating Semantic Knowledge into Lexical Embeddings Based on Information Content Measurement
Hsin-Yang Wang | Wei-Yun Ma
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Distributional word representations are widely used in NLP tasks. These representations are based on an assumption that words with a similar context tend to have a similar meaning. To improve the quality of the context-based embeddings, many researches have explored how to make full use of existing lexical resources. In this paper, we argue that while we incorporate the prior knowledge with context-based embeddings, words with different occurrences should be treated differently. Therefore, we propose to rely on the measurement of information content to control the degree of applying prior knowledge into context-based embeddings - different words would have different learning rates when adjusting their embeddings. In the result, we demonstrate that our embeddings get significant improvements on two different tasks: Word Similarity and Analogical Reasoning.

pdf bib abs
Guess What: A Question Answering Game via On-demand Knowledge Validation
Yu-Sheng Li | Chien-Hui Tseng | Chian-Yun Huang | Wei-Yun Ma
Proceedings of the IJCNLP 2017, System Demonstrations

In this paper, we propose an idea of ondemand knowledge validation and fulfill the idea through an interactive Question-Answering (QA) game system, which is named Guess What. An object (e.g. dog) is first randomly chosen by the system, and then a user can repeatedly ask the system questions in natural language to guess what the object is. The system would respond with yes/no along with a confidence score. Some useful hints can also be given if needed. The proposed framework provides a pioneering example of on-demand knowledge validation in dialog environment to address such needs in AI agents/chatbots. Moreover, the released log data that the system gathered can be used to identify the most critical concepts/attributes of an existing knowledge base, which reflects human’s cognition about the world.

pdf bib abs
CKIP at IJCNLP-2017 Task 2: Neural Valence-Arousal Prediction for Phrases
Peng-Hsuan Li | Wei-Yun Ma | Hsin-Yang Wang
Proceedings of the IJCNLP 2017, Shared Tasks

CKIP takes part in solving the Dimensional Sentiment Analysis for Chinese Phrases (DSAP) share task of IJCNLP 2017. This task calls for systems that can predict the valence and the arousal of Chinese phrases, which are real values between 1 and 9. To achieve this, functions mapping Chinese character sequences to real numbers are built by regression techniques. In addition, the CKIP phrase Valence-Arousal (VA) predictor depends on knowledge of modifier words and head words. This includes the types of known modifier words, VA of head words, and distributional semantics of both these words. The predictor took the second place out of 13 teams on phrase VA prediction, with 0.444 MAE and 0.935 PCC on valence, and 0.395 MAE and 0.904 PCC on arousal.

In this paper, we propose a novel lattice-based MT combination methodology that we call Target-to-Target Decoding (TTD). The combination process is carried out as a “translation” from backbone to the combination result. This perspective suggests the use of existing phrase-based MT techniques in the combination framework. We show how phrase extraction rules and confidence estimations inspired from machine translation improve results. We also propose system-specific LMs for estimating N-gram consensus. Our results show that our approach yields a strong improvement over the best single MT system and competes with other state-of-the-art combination systems.

pdf bib
Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars
Wei-Yun Ma | Kathleen McKeown
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

pdf bib
Detecting and Correcting Syntactic Errors in Machine Translation Using Feature-Based Lexicalized Tree Adjoining Grammars
Wei-Yun Ma | Kathleen McKeown
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 4, December 2012-Special Issue on Selected Papers from ROCLING XXIV

2011

pdf bib
System Combination for Machine Translation Based on Text-to-Text Generation
Wei-Yun Ma | Kathleen Mckeown
Proceedings of Machine Translation Summit XIII: Papers

2009

pdf bib
Where’s the Verb? Correcting Machine Translation During Question Answering
Wei-Yun Ma | Kathy McKeown
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2006

pdf bib abs
Uniform and Effective Tagging of a Heterogeneous Giga-word Corpus
Wei-Yun Ma | Chu-Ren Huang
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus containsroughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenges to effectively tag this large-scale corpus. The evaluation result shows our mechanism of tagging maintains high annotation quality.

pdf bib
中文動詞名物化判斷的統計式模型設計 (A Stochastic Model for Prediction of Deverbal Nouns in Mandarin Chinese) [In Chinese]
Wei-Yun Ma | Chu-Ren Huang
Proceedings of the 18th Conference on Computational Linguistics and Speech Processing

pdf bib
Knowledge-Rich Approach to Automatic Grammatical Information Acquisition: Enriching Chinese Sketch Engine with a Lexical Grammar
Chu-Ren Huang | Wei-Yun Ma | Yi-Ching Wu | Chih-Ming Chiu
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation