Min-Yen Kan


2021

pdf bib
Velocidapter: Task-oriented Dialogue Comprehension Modeling Pairing Synthetic Text Generation with Domain Adaptation
Ibrahim Taha Aksu | Zhengyuan Liu | Min-Yen Kan | Nancy Chen
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We introduce a synthetic dialogue generation framework, Velocidapter, which addresses the corpus availability problem for dialogue comprehension. Velocidapter augments datasets by simulating synthetic conversations for a task-oriented dialogue domain, requiring a small amount of bootstrapping work for each new domain. We evaluate the efficacy of our framework on a task-oriented dialogue comprehension dataset, MRCWOZ, which we curate by annotating questions for slots in the restaurant, taxi, and hotel domains of the MultiWOZ 2.2 dataset (Zang et al., 2020). We run experiments within a low-resource setting, where we pretrain a model on SQuAD, fine-tuning it on either a small original data or on the synthetic data generated by our framework. Velocidapter shows significant improvements using both the transformer-based BERTBase and BiDAF as base models. We further show that the framework is easy to use by novice users and conclude that Velocidapter can greatly help training over task-oriented dialogues, especially for low-resourced emerging domains.

pdf bib
Domain Divergences: A Survey and Empirical Analysis
Abhinav Ramesh Kashyap | Devamanyu Hazarika | Min-Yen Kan | Roger Zimmermann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Domain divergence plays a significant role in estimating the performance of a model in new domains. While there is a significant literature on divergence measures, researchers find it hard to choose an appropriate divergence for a given NLP application. We address this shortcoming by both surveying the literature and through an empirical study. We develop a taxonomy of divergence measures consisting of three classes — Information-theoretic, Geometric, and Higher-order measures and identify the relationships between them. Further, to understand the common use-cases of these measures, we recognise three novel applications – 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild – and use it to organise our literature. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and Higher-order measures are more common for 2). To further help researchers choose appropriate measures to predict drop in performance – an important aspect of Decisions in the Wild, we perform correlation analysis spanning 130 domain adaptation scenarios, 3 varied NLP tasks and 12 divergence measures identified from our survey. To calculate these divergences, we consider the current contextual word representations (CWR) and contrast with the older distributed representations. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.

pdf bib
Unsupervised Multi-hop Question Answering by Question Generation
Liangming Pan | Wenhu Chen | Wenhan Xiong | Min-Yen Kan | William Yang Wang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-intensive. We explore the possibility to train a well-performed multi-hop QA model without referencing any human-labeled multi-hop question-answer pairs, i.e., unsupervised multi-hop QA. We propose MQA-QG, an unsupervised framework that can generate human-like multi-hop training data from both homogeneous and heterogeneous data sources. MQA-QG generates questions by first selecting/generating relevant information from each data source and then integrating the multiple information to form a multi-hop question. Using only generated training data, we can train a competent multi-hop QA which achieves 61% and 83% of the supervised learning performance for the HybridQA and the HotpotQA dataset, respectively. We also show that pretraining the QA system with the generated data would greatly reduce the demand for human-annotated training data. Our codes are publicly available at https://github.com/teacherpeterpan/Unsupervised-Multi-hop-QA.

pdf bib
Reliability Testing for Natural Language Processing Systems
Samson Tan | Shafiq Joty | Kathy Baxter | Araz Taeihagh | Gregory A. Bennett | Min-Yen Kan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Questions of fairness, robustness, and transparency are paramount to address before deploying NLP systems. Central to these concerns is the question of reliability: Can NLP systems reliably treat different demographics fairly and function correctly in diverse and noisy environments? To address this, we argue for the need for reliability testing and contextualize it among existing work on improving accountability. We show how adversarial attacks can be reframed for this goal, via a framework for developing reliability tests. We argue that reliability testing — with an emphasis on interdisciplinary collaboration — will enable rigorous and targeted testing, and aid in the enactment and enforcement of industry standards.

pdf bib
Zero-shot Fact Verification by Claim Generation
Liangming Pan | Wenhu Chen | Wenhan Xiong | Min-Yen Kan | William Yang Wang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Neural models for automated fact verification have achieved promising results thanks to the availability of large, human-annotated datasets. However, for each new domain that requires fact verification, creating a dataset by manually writing claims and linking them to their supporting evidence is expensive. We develop QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia. QACG generates question-answer pairs from the evidence and then converts them into different types of claims. Experiments on the FEVER dataset show that our QACG framework significantly reduces the demand for human-annotated training data. In a zero-shot scenario, QACG improves a RoBERTa model’s F1 from 50% to 77%, equivalent in performance to 2K+ manually-curated examples. Our QACG code is publicly available.

pdf bib
Analyzing the Domain Robustness of Pretrained Language Models, Layer by Layer
Abhinav Ramesh Kashyap | Laiba Mehnaz | Bhavitvya Malik | Abdul Waheed | Devamanyu Hazarika | Min-Yen Kan | Rajiv Ratn Shah
Proceedings of the Second Workshop on Domain Adaptation for NLP

The robustness of pretrained language models(PLMs) is generally measured using performance drops on two or more domains. However, we do not yet understand the inherent robustness achieved by contributions from different layers of a PLM. We systematically analyze the robustness of these representations layer by layer from two perspectives. First, we measure the robustness of representations by using domain divergence between two domains. We find that i) Domain variance increases from the lower to the upper layers for vanilla PLMs; ii) Models continuously pretrained on domain-specific data (DAPT)(Gururangan et al., 2020) exhibit more variance than their pretrained PLM counterparts; and that iii) Distilled models (e.g., DistilBERT) also show greater domain variance. Second, we investigate the robustness of representations by analyzing the encoded syntactic and semantic information using diagnostic probes. We find that similar layers have similar amounts of linguistic information for data from an unseen domain.

2020

pdf bib
Exploring Question-Specific Rewards for Generating Deep Questions
Yuxi Xie | Liangming Pan | Dongzhe Wang | Min-Yen Kan | Yansong Feng
Proceedings of the 28th International Conference on Computational Linguistics

Recent question generation (QG) approaches often utilize the sequence-to-sequence framework (Seq2Seq) to optimize the log likelihood of ground-truth questions using teacher forcing. However, this training objective is inconsistent with actual question quality, which is often reflected by certain global properties such as whether the question can be answered by the document. As such, we directly optimize for QG-specific objectives via reinforcement learning to improve question quality. We design three different rewards that target to improve the fluency, relevance, and answerability of generated questions. We conduct both automatic and human evaluations in addition to thorough analysis to explore the effect of each QG-specific reward. We find that optimizing on question-specific rewards generally leads to better performance in automatic evaluation metrics. However, only the rewards that correlate well with human judgement (e.g., relevance) lead to real improvement in question quality. Optimizing for the others, especially answerability, introduces incorrect bias to the model, resulting in poorer question quality. The code is publicly available at https://github.com/YuxiXie/RL-for-Question-Generation.

pdf bib
Molweni: A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure
Jiaqi Li | Ming Liu | Min-Yen Kan | Zihao Zheng | Zekun Wang | Wenqiang Lei | Ting Liu | Bing Qin
Proceedings of the 28th International Conference on Computational Linguistics

Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni’s source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models: BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7% F1 on Molweni’s questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.

pdf bib
Retrieving Skills from Job Descriptions: A Language Model Based Extreme Multi-label Classification Framework
Akshay Bhola | Kishaloy Halder | Animesh Prasad | Min-Yen Kan
Proceedings of the 28th International Conference on Computational Linguistics

We introduce a deep learning model to learn the set of enumerated job skills associated with a job description. In our analysis of a large-scale government job portal mycareersfuture.sg, we observe that as much as 65% of job descriptions miss describing a significant number of relevant skills. Our model addresses this task from the perspective of an extreme multi-label classification (XMLC) problem, where descriptions are the evidence for the binary relevance of thousands of individual skills. Building upon the current state-of-the-art language modeling approaches such as BERT, we show our XMLC method improves on an existing baseline solution by over 9% and 7% absolute improvements in terms of recall and normalized discounted cumulative gain. We further show that our approach effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings by taking into account the structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process. We further show that our approach, to ensure the BERT-XMLC model accounts for structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process, effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings. To facilitate future research and replication of our work, we have made the dataset and the implementation of our model publicly available.

pdf bib
Expertise Style Transfer: A New Task Towards Better Communication between Experts and Laymen
Yixin Cao | Ruihao Shui | Liangming Pan | Min-Yen Kan | Zhiyuan Liu | Tat-Seng Chua
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The curse of knowledge can impede communication between experts and laymen. We propose a new task of expertise style transfer and contribute a manually annotated dataset with the goal of alleviating such cognitive biases. Solving this task not only simplifies the professional language, but also improves the accuracy and expertise level of laymen descriptions using simple words. This is a challenging task, unaddressed in previous work, as it requires the models to have expert intelligence in order to modify text with a deep understanding of domain knowledge and structures. We establish the benchmark performance of five state-of-the-art models for style transfer and text simplification. The results demonstrate a significant gap between machine and human performance. We also discuss the challenges of automatic evaluation, to provide insights into future research directions. The dataset is publicly available at https://srhthu.github.io/expertise-style-transfer/.

pdf bib
Semantic Graphs for Generating Deep Questions
Liangming Pan | Yuxi Xie | Yansong Feng | Tat-Seng Chua | Min-Yen Kan
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper proposes the problem of Deep Question Generation (DQG), which aims to generate complex questions that require reasoning over multiple pieces of information about the input passage. In order to capture the global structure of the document and facilitate reasoning, we propose a novel framework that first constructs a semantic-level graph for the input document and then encodes the semantic graph by introducing an attention-based GGNN (Att-GGNN). Afterward, we fuse the document-level and graph-level representations to perform joint training of content selection and question decoding. On the HotpotQA deep-question centric dataset, our model greatly improves performance over questions requiring reasoning over multiple facts, leading to state-of-the-art performance. The code is publicly available at https://github.com/WING-NUS/SG-Deep-Question-Generation.

pdf bib
It’s Morphin’ Time! Combating Linguistic Discrimination with Inflectional Perturbations
Samson Tan | Shafiq Joty | Min-Yen Kan | Richard Socher
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Training on only perfect Standard English corpora predisposes pre-trained neural networks to discriminate against minorities from non-standard linguistic backgrounds (e.g., African American Vernacular English, Colloquial Singapore English, etc.). We perturb the inflectional morphology of words to craft plausible and semantically similar adversarial examples that expose these biases in popular NLP models, e.g., BERT and Transformer, and show that adversarially fine-tuning them for a single epoch significantly improves robustness without sacrificing performance on clean data.

pdf bib
Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding
Samson Tan | Shafiq Joty | Lav Varshney | Min-Yen Kan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

pdf bib
Re-examining the Role of Schema Linking in Text-to-SQL
Wenqiang Lei | Weixin Wang | Zhixin Ma | Tian Gan | Wei Lu | Min-Yen Kan | Tat-Seng Chua
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In existing sophisticated text-to-SQL models, schema linking is often considered as a simple, minor component, belying its importance. By providing a schema linking corpus based on the Spider text-to-SQL dataset, we systematically study the role of schema linking. We also build a simple BERT-based baseline, called Schema-Linking SQL (SLSQL) to perform a data-driven study. We find when schema linking is done well, SLSQL demonstrates good performance on Spider despite its structural simplicity. Many remaining errors are attributable to corpus noise. This suggests schema linking is the crux for the current text-to-SQL task. Our analytic studies provide insights on the characteristics of schema linking for future developments of text-to-SQL tasks.

pdf bib
SciWING– A Software Toolkit for Scientific Document Processing
Abhinav Ramesh Kashyap | Min-Yen Kan
Proceedings of the First Workshop on Scholarly Document Processing

We introduce SciWING, an open-source soft-ware toolkit which provides access to state-of-the-art pre-trained models for scientific document processing (SDP) tasks, such as citation string parsing, logical structure recovery and citation intent classification. Compared to other toolkits, SciWING follows a full neural pipeline and provides a Python inter-face for SDP. When needed, SciWING provides fine-grained control for rapid experimentation with different models by swapping and stacking different modules. Transfer learning from general and scientific documents specific pre-trained transformers (i.e., BERT, SciBERT, etc.) can be performed. SciWING incorporates ready-to-use web and terminal-based applications and demonstrations to aid adoption and development. The toolkit is available from http://sciwing.io and the demos are available at http://rebrand.ly/sciwing-demo.

2019

pdf bib
Sentiment Aware Neural Machine Translation
Chenglei Si | Kui Wu | Ai Ti Aw | Min-Yen Kan
Proceedings of the 6th Workshop on Asian Translation

Sentiment ambiguous lexicons refer to words where their polarity depends strongly on con- text. As such, when the context is absent, their translations or their embedded sentence ends up (incorrectly) being dependent on the training data. While neural machine translation (NMT) has achieved great progress in recent years, most systems aim to produce one single correct translation for a given source sentence. We investigate the translation variation in two sentiment scenarios. We perform experiments to study the preservation of sentiment during translation with three different methods that we propose. We conducted tests with both sentiment and non-sentiment bearing contexts to examine the effectiveness of our methods. We show that NMT can generate both positive- and negative-valent translations of a source sentence, based on a given input sentiment label. Empirical evaluations show that our valence-sensitive embedding (VSE) method significantly outperforms a sequence-to-sequence (seq2seq) baseline, both in terms of BLEU score and ambiguous word translation accuracy in test, given non-sentiment bearing contexts.

pdf bib
Glocal: Incorporating Global Information in Local Convolution for Keyphrase Extraction
Animesh Prasad | Min-Yen Kan
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Graph Convolutional Networks (GCNs) are a class of spectral clustering techniques that leverage localized convolution filters to perform supervised classification directly on graphical structures. While such methods model nodes’ local pairwise importance, they lack the capability to model global importance relative to other nodes of the graph. This causes such models to miss critical information in tasks where global ranking is a key component for the task, such as in keyphrase extraction. We address this shortcoming by allowing the proper incorporation of global information into the GCN family of models through the use of scaled node weights. In the context of keyphrase extraction, incorporating global random walk scores obtained from TextRank boosts performance significantly. With our proposed method, we achieve state-of-the-art results, bettering a strong baseline by an absolute 2% increase in F1 score.

pdf bib
Predicting Helpful Posts in Open-Ended Discussion Forums: A Neural Architecture
Kishaloy Halder | Min-Yen Kan | Kazunari Sugiyama
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Users participate in online discussion forums to learn from others and share their knowledge with the community. They often start a thread with a question or by sharing their new findings on a certain topic. We find that, unlike Community Question Answering, where questions are mostly factoid based, the threads in a forum are often open-ended (e.g., asking for recommendations from others) without a single correct answer. In this paper, we address the task of identifying helpful posts in a forum thread to help users comprehend long running discussion threads, which often contain repetitive or irrelevant posts. We propose a recurrent neural network based architecture to model (i) the relevance of a post regarding the original post starting the thread and (ii) the novelty it brings to the discussion, compared to the previous posts in the thread. Experimental results on different types of online forum datasets show that our model significantly outperforms the state-of-the-art neural network models for text classification.

pdf bib
Dataset Mention Extraction and Classification
Animesh Prasad | Chenglei Si | Min-Yen Kan
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Datasets are integral artifacts of empirical scientific research. However, due to natural language variation, their recognition can be difficult and even when identified, can often be inconsistently referred across and within publications. We report our approach to the Coleridge Initiative’s Rich Context Competition, which tasks participants with identifying dataset surface forms (dataset mention extraction) and associating the extracted mention to its referred dataset (dataset classification). In this work, we propose various neural baselines and evaluate these model on one-plus and zero-shot classification scenarios. We further explore various joint learning approaches - exploring the synergy between the tasks - and report the issues with such techniques.

2018

pdf bib
Identifying Emergent Research Trends by Key Authors and Phrases
Shenhao Jiang | Animesh Prasad | Min-Yen Kan | Kazunari Sugiyama
Proceedings of the 27th International Conference on Computational Linguistics

Identifying emergent research trends is a key issue for both primary researchers as well as secondary research managers. Such processes can uncover the historical development of an area, and yield insight on developing topics. We propose an embedded trend detection framework for this task which incorporates our bijunctive hypothesis that important phrases are written by important authors within a field and vice versa. By ranking both author and phrase information in a multigraph, our method jointly determines key phrases and authoritative authors. We represent this intermediate output as phrasal embeddings, and feed this to a recurrent neural network (RNN) to compute trend scores that identify research trends. Over two large datasets of scientific articles, we demonstrate that our approach successfully detects past trends from the field, outperforming baselines based solely on text centrality or citation.

pdf bib
Sequicity: Simplifying Task-oriented Dialogue Systems with Single Sequence-to-Sequence Architectures
Wenqiang Lei | Xisen Jin | Min-Yen Kan | Zhaochun Ren | Xiangnan He | Dawei Yin
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing solutions to task-oriented dialogue systems follow pipeline designs which introduces architectural complexity and fragility. We propose a novel, holistic, extendable framework based on a single sequence-to-sequence (seq2seq) model which can be optimized with supervised or reinforcement learning. A key contribution is that we design text spans named belief spans to track dialogue believes, allowing task-oriented dialogue systems to be modeled in a seq2seq way. Based on this, we propose a simplistic Two Stage CopyNet instantiation which emonstrates good scalability: significantly reducing model complexity in terms of number of parameters and training time by a magnitude. It significantly outperforms state-of-the-art pipeline-based methods on large datasets and retains a satisfactory entity match rate on out-of-vocabulary (OOV) cases where pipeline-designed competitors totally fail.

pdf bib
The ACL Anthology: Current State and Future Directions
Daniel Gildea | Min-Yen Kan | Nitin Madnani | Christoph Teichmann | Martín Villalba
Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

The Association of Computational Linguistic’s Anthology is the open source archive, and the main source for computational linguistics and natural language processing’s scientific literature. The ACL Anthology is currently maintained exclusively by community volunteers and has to be available and up-to-date at all times. We first discuss the current, open source approach used to achieve this, and then discuss how the planned use of Docker images will improve the Anthology’s long-term stability. This change will make it easier for researchers to utilize Anthology data for experimentation. We believe the ACL community can directly benefit from the extension-friendly architecture of the Anthology. We end by issuing an open challenge of reviewer matching we encourage the community to rally towards.

pdf bib
Countering Position Bias in Instructor Interventions in MOOC Discussion Forums
Muthu Kumar Chandrasekaran | Min-Yen Kan
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

We systematically confirm that instructors are strongly influenced by the user interface presentation of Massive Online Open Course (MOOC) discussion forums. In a large scale dataset, we conclusively show that instructor interventions exhibit strong position bias, as measured by the position where the thread appeared on the user interface at the time of intervention. We measure and remove this bias, enabling unbiased statistical modelling and evaluation. We show that our de-biased classifier improves predicting interventions over the state-of-the-art on courses with sufficient number of interventions by 8.2% in F1 and 24.4% in recall on average.

pdf bib
Treatment Side Effect Prediction from Online User-generated Content
Van Hoang Nguyen | Kazunari Sugiyama | Min-Yen Kan | Kishaloy Halder
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

With Health 2.0, patients and caregivers increasingly seek information regarding possible drug side effects during their medical treatments in online health communities. These are helpful platforms for non-professional medical opinions, yet pose risk of being unreliable in quality and insufficient in quantity to cover the wide range of potential drug reactions. Existing approaches which analyze such user-generated content in online forums heavily rely on feature engineering of both documents and users, and often overlook the relationships between posts within a common discussion thread. Inspired by recent advancements, we propose a neural architecture that models the textual content of user-generated documents and user experiences in online communities to predict side effects during treatment. Experimental results show that our proposed architecture outperforms baseline models.

2017

pdf bib
Modeling Temporal Progression of Emotional Status in Mental Health Forum: A Recurrent Neural Net Approach
Kishaloy Halder | Lahari Poddar | Min-Yen Kan
Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Patients turn to Online Health Communities not only for information on specific conditions but also for emotional support. Previous research has indicated that the progression of emotional status can be studied through the linguistic patterns of an individual’s posts. We analyze a real-world dataset from the Mental Health section of HealthBoards.com. Estimated from the word usages in their posts, we find that the emotional progress across patients vary widely. We study the problem of predicting a patient’s emotional status in the future from her past posts and we propose a Recurrent Neural Network (RNN) based architecture to address it. We find that the future emotional status can be predicted with reasonable accuracy given her historical posts and participation features. Our evaluation results demonstrate the efficacy of our proposed architecture, by outperforming state-of-the-art approaches with over 0.13 reduction in Mean Absolute Error.

pdf bib
WING-NUS at SemEval-2017 Task 10: Keyphrase Extraction and Classification as Joint Sequence Labeling
Animesh Prasad | Min-Yen Kan
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe an end-to-end pipeline processing approach for SemEval 2017’s Task 10 to extract keyphrases and their relations from scientific publications. We jointly identify and classify keyphrases by modeling the subtasks as sequential labeling. Our system utilizes standard, surface-level features along with the adjacent word features, and performs conditional decoding on whole text to extract keyphrases. We focus only on the identification and typing of keyphrases (Subtasks A and B, together referred as extraction), but provide an end-to-end system inclusive of keyphrase relation identification (Subtask C) for completeness. Our top performing configuration achieves an F1 of 0.27 for the end-to-end keyphrase extraction and relation identification scenario on the final test data, and compares on par to other top ranked systems for keyphrase extraction. Our system outperforms other techniques that do not employ global decoding and hence do not account for dependencies between keyphrases. We believe this is crucial for keyphrase classification in the given context of scientific document mining.

pdf bib
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Regina Barzilay | Min-Yen Kan
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Regina Barzilay | Min-Yen Kan
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2016

pdf bib
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)
Guillaume Cabanac | Muthu Kumar Chandrasekaran | Ingo Frommholz | Kokil Jaidka | Min-Yen Kan | Philipp Mayr | Dietmar Wolfram
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
Overview of the CL-SciSumm 2016 Shared Task
Kokil Jaidka | Muthu Kumar Chandrasekaran | Sajal Rustagi | Min-Yen Kan
Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL)

pdf bib
A Comparison of Word Embeddings for English and Cross-Lingual Chinese Word Sense Disambiguation
Hong Jin Kang | Tao Chen | Muthu Kumar Chandrasekaran | Min-Yen Kan
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

Word embeddings are now ubiquitous forms of word representation in natural language processing. There have been applications of word embeddings for monolingual word sense disambiguation (WSD) in English, but few comparisons have been done. This paper attempts to bridge that gap by examining popular embeddings for the task of monolingual English WSD. Our simplified method leads to comparable state-of-the-art performance without expensive retraining. Cross-Lingual WSD – where the word senses of a word in a source language come from a separate target translation language – can also assist in language learning; for example, when providing translations of target vocabulary for learners. Thus we have also applied word embeddings to the novel task of cross-lingual WSD for Chinese and provide a public dataset for further benchmarking. We have also experimented with using word embeddings for LSTM networks and found surprisingly that a basic LSTM network does not work well. We discuss the ramifications of this outcome.

2015

pdf bib
Keywords, phrases, clauses and sentences: topicality, indicativeness and informativeness at scales
Min-Yen Kan
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

pdf bib
Interactive Second Language Learning from News Websites
Tao Chen | Naijia Zheng | Yue Zhao | Muthu Kumar Chandrasekaran | Min-Yen Kan
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

2014

pdf bib
Exploiting Timelines to Enhance Multi-document Summarization
Jun-Ping Ng | Yan Chen | Min-Yen Kan | Zhoujun Li
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Exploiting Discourse Analysis for Article-Wide Temporal Classification
Jun-Ping Ng | Min-Yen Kan | Ziheng Lin | Wei Feng | Bin Chen | Jian Su | Chew-Lim Tan
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Mining Scientific Terms and their Definitions: A Study of the ACL Anthology
Yiping Jin | Min-Yen Kan | Jun-Ping Ng | Xiangnan He
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Chinese Informal Word Normalization: an Experimental Study
Aobo Wang | Min-Yen Kan | Daniel Andrade | Takashi Onishi | Kai Ishikawa
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Mining Informal Language from Chinese Microtext: Joint Word Recognition and Segmentation
Aobo Wang | Min-Yen Kan
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Re-tweeting from a linguistic perspective
Aobo Wang | Tao Chen | Min-Yen Kan
Proceedings of the Second Workshop on Language in Social Media

pdf bib
Integrating User-Generated Content in the ACL Anthology
Praveen Bysani | Min-Yen Kan
Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

pdf bib
Exploiting Category-Specific Information for Multi-Document Summarization
Jun-Ping Ng | Praveen Bysani | Ziheng Lin | Min-Yen Kan | Chew-Lim Tan
Proceedings of COLING 2012

pdf bib
Improved Temporal Relation Classification using Dependency Parses and Selective Crowdsourced Annotations
Jun-Ping Ng | Min-Yen Kan
Proceedings of COLING 2012

pdf bib
Combining Coherence Models and Machine Translation Evaluation Metrics for Summarization Evaluation
Ziheng Lin | Chang Liu | Hwee Tou Ng | Min-Yen Kan
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2011

pdf bib
Automatically Evaluating Text Coherence Using Discourse Relations
Ziheng Lin | Hwee Tou Ng | Min-Yen Kan
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles
Su Nam Kim | Olena Medelyan | Min-Yen Kan | Timothy Baldwin
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Extracting Formulaic and Free Text Clinical Research Articles Metadata using Conditional Random Fields
Sein Lin | Jun-Ping Ng | Shreyasee Pradhan | Jatin Shah | Ricardo Pietrobon | Min-Yen Kan
Proceedings of the NAACL HLT 2010 Second Louhi Workshop on Text and Data Mining of Health Documents

pdf bib
A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages
Minh-Thang Luong | Preslav Nakov | Min-Yen Kan
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction
Su Nam Kim | Timothy Baldwin | Min-Yen Kan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Enhancing Morphological Alignment for Translating Highly Inflected Languages
Minh-Thang Luong | Min-Yen Kan
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Towards Automated Related Work Summarization
Cong Duy Vu Hoang | Min-Yen Kan
Coling 2010: Posters

2009

pdf bib
Topological Ordering of Function Words in Hierarchical Phrase-based Translation
Hendra Setiawan | Min-Yen Kan | Haizhou Li | Philip Resnik
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank
Ziheng Lin | Min-Yen Kan | Hwee Tou Ng
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Extracting Domain-Specific Words - A Statistical Approach
Su Nam Kim | Timothy Baldwin | Min-Yen Kan
Proceedings of the Australasian Language Technology Association Workshop 2009

pdf bib
Re-examining Automatic Keyphrase Extraction Approaches in Scientific Articles
Su Nam Kim | Min-Yen Kan
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf bib
A re-examination of lexical association measures
Hung Huu Hoang | Su Nam Kim | Min-Yen Kan
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf bib
Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL)
Min-Yen Kan | Simone Teufel
Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL)

pdf bib
FireCite: Lightweight real-time reference string extraction from webpages
Ching Hoi Andy Hong | Jesse Prabawa Gozali | Min-Yen Kan
Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL)

2008

pdf bib
Modeling Context in Scenario Template Creation
Long Qiu | Min-Yen Kan | Tat-Seng Chua
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics
Steven Bird | Robert Dale | Bonnie Dorr | Bryan Gibson | Mark Joseph | Min-Yen Kan | Dongwon Lee | Brett Powley | Dragomir Radev | Yee Fan Tan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The ACL Anthology is a digital archive of conference and journal papers in natural language processing and computational linguistics. Its primary purpose is to serve as a reference repository of research results, but we believe that it can also be an object of study and a platform for research in its own right. We describe an enriched and standardized reference corpus derived from the ACL Anthology that can be used for research in scholarly document processing. This corpus, which we call the ACL Anthology Reference Corpus (ACL ARC), brings together the recent activities of a number of research groups around the world. Our goal is to make the corpus widely available, and to encourage other researchers to use it as a standard testbed for experiments in both bibliographic and bibliometric research.

pdf bib
ParsCit: an Open-source CRF Reference String Parsing Package
Isaac Councill | C. Lee Giles | Min-Yen Kan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

2007

pdf bib
Timestamped Graphs: Evolutionary Models of Text for Multi-Document Summarization
Ziheng Lin | Min-Yen Kan
Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing

pdf bib
PSNUS: Web People Name Disambiguation by Simple Clustering with Rich Features
Ergin Elmacioglu | Yee Fan Tan | Su Yan | Min-Yen Kan | Dongwon Lee
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Ordering Phrases with Function Words
Hendra Setiawan | Min-Yen Kan | Haizhou Li
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Paraphrase Recognition via Dissimilarity Significance Classification
Long Qiu | Min-Yen Kan | Tat-Seng Chua
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Extending corpus-based identification of light verb constructions using a supervised learning framework
Yee Fan Tan | Min-Yen Kan | Hang Cui
Proceedings of the Workshop on Multi-word-expressions in a multilingual context

2004

pdf bib
A Public Reference Implementation of the RAP Anaphora Resolution Algorithm
Long Qiu | Min-Yen Kan | Tat-Seng Chua
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2002

pdf bib
Using the Annotated Bibliography as a Resource for Indicative Summarization
Min-Yen Kan | Judith L. Klavans | Kathleen R. McKeown
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Corpus-trained Text Generation for Summarization
Min-Yen Kan | Kathleen R. McKeown
Proceedings of the International Natural Language Generation Conference

2001

pdf bib
Applying Natural Language Generation to Indicative Summarization
Min-Yen Kan | Kathleen R. McKeown | Judith L. Klavans
Proceedings of the ACL 2001 Eighth European Workshop on Natural Language Generation (EWNLG)

1998

pdf bib
Role of Verbs in Document Analysis
Judith Klavans | Min-Yen Kan
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Role of Verbs in Document Analysis
Judith L. Klavans | Min-Yen Kan
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

pdf bib
Linear Segmentation and Segment Significance
Min-Yen Kan | Judith L. Klavans | Kathleen R. McKeown
Sixth Workshop on Very Large Corpora

Search
Co-authors