See Kiong Ng

Also published as: See-Kiong Ng


pdf bib
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification
Yang Xiao | Jinlan Fu | See-Kiong Ng | Pengfei Liu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we ask the research question of whether all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all datasets with features explored in this work on DataLab.


pdf bib
NUS-IDS at CASE 2021 Task 1: Improving Multilingual Event Sentence Coreference Identification With Linguistic Information
Fiona Anting Tan | Sujatha Das Gollapalli | See-Kiong Ng
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

Event Sentence Coreference Identification (ESCI) aims to cluster event sentences that refer to the same event together for information extraction. We describe our ESCI solution developed for the ACL-CASE 2021 shared tasks on the detection and classification of socio-political and crisis event information in a multilingual setting. For a given article, our proposed pipeline comprises of an accurate sentence pair classifier that identifies coreferent sentence pairs and subsequently uses these predicted probabilities to cluster sentences into groups. Sentence pair representations are constructed from fine-tuned BERT embeddings plus POS embeddings fed through a BiLSTM model, and combined with linguistic-based lexical and semantic similarities between sentences. Our best models ranked 2nd, 1st and 2nd and obtained CoNLL F1 scores of 81.20%, 93.03%, 83.15% for the English, Portuguese and Spanish test sets respectively in the ACL-CASE 2021 competition.

pdf bib
Causal Augmentation for Causal Sentence Classification
Fiona Anting Tan | Devamanyu Hazarika | See-Kiong Ng | Soujanya Poria | Roger Zimmermann
Proceedings of the First Workshop on Causal Inference and NLP

Scarcity of annotated causal texts leads to poor robustness when training state-of-the-art language models for causal sentence classification. In particular, we found that models misclassify on augmented sentences that have been negated or strengthened with respect to its causal meaning. This is worrying since minor linguistic differences in causal sentences can have disparate meanings. Therefore, we propose the generation of counterfactual causal sentences by creating contrast sets (Gardner et al., 2020) to be included during model training. We experimented on two model architectures and predicted on two out-of-domain corpora. While our strengthening schemes proved useful in improving model performance, for negation, regular edits were insufficient. Thus, we also introduce heuristics like shortening or multiplying root words of a sentence. By including a mixture of edits when training, we achieved performance improvements beyond the baseline across both models, and within and out of corpus’ domain, suggesting that our proposed augmentation can also help models generalize.

pdf bib
NUS-IDS at FinCausal 2021: Dependency Tree in Graph Neural Network for Better Cause-Effect Span Detection
Fiona Anting Tan | See-Kiong Ng
Proceedings of the 3rd Financial Narrative Processing Workshop

pdf bib
On Generating Fact-Infused Question Variations
Arthur Deschamps | Sujatha Das Gollapalli | See-Kiong Ng
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

To fully model human-like ability to ask questions, automatic question generation (QG) models must be able to produce multiple expressions of the same question with different levels of detail. Unfortunately, existing datasets available for learning QG do not include paraphrases or question variations affecting a model’s ability to learn this capability. We present FIRS, a dataset containing human-generated fact-infused rewrites of questions from the widely-used SQuAD dataset to address this limitation. Questions in FIRS were obtained by combining a given question with facts of entities referenced in the question. We study a double encoder-decoder model, Fact-Infused Question Generator (FIQG), for learning to generate fact-infused questions from a given question. Experimental results show that FIQG effectively incorporates information from facts to add more detail to a given question. To the best of our knowledge, ours is the first study to present fact-infusion as a novel form of question paraphrasing.

pdf bib
Suicide Risk Prediction by Tracking Self-Harm Aspects in Tweets: NUS-IDS at the CLPsych 2021 Shared Task
Sujatha Das Gollapalli | Guilherme Augusto Zagatti | See-Kiong Ng
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

We describe our system for identifying users at-risk for suicide based on their tweets developed for the CLPsych 2021 Shared Task. Based on research in mental health studies linking self-harm tendencies with suicide, in our system, we attempt to characterize self-harm aspects expressed in user tweets over a period of time. To this end, we design SHTM, a Self-Harm Topic Model that combines Latent Dirichlet Allocation with a self-harm dictionary for modeling daily tweets of users. Next, differences in moods and topics over time are captured as features to train a deep learning model for suicide prediction.


pdf bib
ESTeR: Combining Word Co-occurrences and Word Associations for Unsupervised Emotion Detection
Sujatha Das Gollapalli | Polina Rozenshtein | See-Kiong Ng
Findings of the Association for Computational Linguistics: EMNLP 2020

Accurate detection of emotions in user- generated text was shown to have several applications for e-commerce, public well-being, and disaster management. Currently, the state-of-the-art performance for emotion detection in text is obtained using complex, deep learning models trained on domain-specific, labeled data. In this paper, we propose ESTeR , an unsupervised model for identifying emotions using a novel similarity function based on random walks on graphs. Our model combines large-scale word co-occurrence information with word-associations from lexicons avoiding not only the dependence on labeled datasets, but also an explicit mapping of words to latent spaces used in emotion-enriched word embeddings. Our similarity function can also be computed efficiently. We study a range of datasets including recent tweets related to COVID-19 to illustrate the superior performance of our model and report insights on public emotions during the on-going pandemic.


pdf bib
Utilizing Temporal Information for Taxonomy Construction
Luu Anh Tuan | Siu Cheung Hui | See Kiong Ng
Transactions of the Association for Computational Linguistics, Volume 4

Taxonomies play an important role in many applications by organizing domain knowledge into a hierarchy of ‘is-a’ relations between terms. Previous work on automatic construction of taxonomies from text documents either ignored temporal information or used fixed time periods to discretize the time series of documents. In this paper, we propose a time-aware method to automatically construct and effectively maintain a taxonomy from a given series of documents preclustered for a domain of interest. The method extracts temporal information from the documents and uses a timestamp contribution function to score the temporal relevance of the evidence from source texts when identifying the taxonomic relations for constructing the taxonomy. Experimental results show that our proposed method outperforms the state-of-the-art methods by increasing F-measure up to 7%–20%. Furthermore, the proposed method can incrementally update the taxonomy by adding fresh relations from new data and removing outdated relations using an information decay function. It thus avoids rebuilding the whole taxonomy from scratch for every update and keeps the taxonomy effectively up-to-date in order to track the latest information trends in the rapidly evolving domain.

pdf bib
Learning Term Embeddings for Taxonomic Relation Identification Using Dynamic Weighting Neural Network
Anh Tuan Luu | Yi Tay | Siu Cheung Hui | See Kiong Ng
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing


pdf bib
Incorporating Trustiness and Collective Synonym/Contrastive Evidence into Taxonomy Construction
Anh Tuan Luu | Jung-jae Kim | See Kiong Ng
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing


pdf bib
Taxonomy Construction Using Syntactic Contextual Evidence
Anh Tuan Luu | Jung-jae Kim | See Kiong Ng
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)


pdf bib
Negative Training Data Can be Harmful to Text Classification
Xiao-Li Li | Bing Liu | See-Kiong Ng
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Distributional Similarity vs. PU Learning for Entity Set Expansion
Xiao-Li Li | Lei Zhang | Bing Liu | See-Kiong Ng
Proceedings of the ACL 2010 Conference Short Papers


pdf bib
Probabilistic LR Parsing for General Context-Free Grammars
See-Kiong Ng | Masaru Tomita
Proceedings of the Second International Workshop on Parsing Technologies

To combine the advantages of probabilistic grammars and generalized LR parsing, an algorithm for constructing a probabilistic LR parser given a probabilistic context-free grammar is needed. In this paper, implementation issues in adapting Tomita’s generalized LR parser with graph-structured stack to perform probabilistic parsing are discussed. Wright and Wrigley (1989) has proposed a probabilistic LR-table construction algorithm for non-left-recursive context-free grammars. To account for left recursions, a method for computing item probabilities using the generation of systems of linear equations is presented. The notion of deferred probabilities is proposed as a means for dealing with similar item sets with differing probability assignments.