Chu-Ren Huang

Also published as: Chu-ren Huang


2026

Dense large language models (LLMs) face critical efficiency bottlenecks, as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods (static pruning or dynamic activation) partially address this issue, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by the human brain’s dual-process mechanisms — predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex contexts — we propose CLADA, a Cognitive-Load-Aware Dynamic Activation framework that synergizes statistical sparsity with semantic adaptability.Our key insight is that LLM activations exhibit two complementary patterns: 1. Global Statistical Sparsity driven by sequence-level prefix information, and 2. Local Semantic Adaptability modulated by cognitive load metrics (e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline derived from offline error-controlled optimization ensures over 40% sparsity, which is then dynamically adjusted using real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves 20% average speedup with less than 2% accuracy degradation, outperforming Griffin (over 5% degradation) and TT (negligible speedup).Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis (R2 = 0.17), revealing a sparsity–adaptation synergy. Requiring no retraining or architectural changes, CLADA provides a deployable solution for resource-aware LLM inference while advancing biologically inspired AI design.

2025

The research on classical Chinese event extraction trends to directly graft the complex modeling from English or modern Chinese works, neglecting the utilization of the unique characteristic of this language. We argue that, compared with grafting the sophisticated methods from other languages, focusing on classical Chinese’s inimitable source of __Ancient Literature__ could provide us with extra and comprehensive semantics in event extraction. Motivated by this, we propose a Literary Vision-Language Model (VLM) for classical Chinese event extraction, integrating with literature annotations, historical background and character glyph to capture the inner- and outer-context information from the sequence. Extensive experiments build a new state-of-the-art performance in the GuwenEE, CHED datasets, which underscores the effectiveness of our proposed VLM, and more importantly, these unique features can be obtained precisely at nearly zero cost.
Sequence labeling models often benefit from incorporating external knowledge. However, this practice introduces data heterogeneity and complicates the model with additional modules, leading to increased expenses for training a high-performing model. To address this challenge, we propose a dual-stage curriculum learning (DCL) framework specifically designed for sequence labeling tasks. The DCL framework enhances training by gradually introducing data instances from easy to hard. Additionally, we introduce a dynamic metric for evaluating the difficulty levels of sequence labeling tasks. Experiments on several sequence labeling datasets show that our model enhances performance and accelerates training, mitigating the slow training issue of complex models.
Large language models (LLMs) have demonstrated increasing proficiency in general-purpose translation, yet their effectiveness in creative domains such as game localization remains underexplored. This study focuses on the role of LLMs in game localization from both linguistic quality and sociocultural adequacy through a case study of the video game Black Myth: Wukong. Results indicate that LLMs demonstrate adequate competence in accuracy and fluency, achieving performance comparable to human translators. However, limitations remain in the literal translation of culture-specific terms and offensive language. Human oversight is required to ensure nuanced cultural authenticity and sensitivity. Insights from human evaluations also suggest that current automatic metrics and the Multidimensional Quality Metrics framework may be inadequate for evaluating creative translation. Finally, varying human preferences in localization pose a learning ambiguity for LLMs to perform optimal translation strategies. The findings highlight the potential and shortcomings of LLMs to serve as collaborative tools in game localization workflows. Data are available at https://github.com/zcocozz/wukong-localization.
As the training of large language models (LLMs) will encounter high computational costs, massive works are now focusing on inference. Their methods can be generally summarised as re-sampling the target multiple times and performing a vote upon the outputs. Despite bringing significant performance improvements, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple yet efficient inference strategies named __Hybrid Sampling__ that combining both multiple and single sampling to greatly reduce the cost of multiple sampling without sacrificing performance. __Hybrid Sampling__ could dynamically choose the essential part of generated sequence for multiple sampling and proceed the rest with single sampling, achieving a performance-cost balance. Extensive experiments in several benchmarks underscore the robustness and effectiveness of our proposed Hybrid Sampling and more importantly, it is much faster.
Recent research work on textual Aspect-Based Sentiment Analysis (ABSA) have achieved promising performance. However, a persistent challenge lies in the limited semantics derived from the raw data. To address this issue, researchers have explored enhancing textual ABSA with additional augmentations, they either craft audio, text and linguistic features based on the input, or rely on user-posted images. Yet these approaches have their limitations: the former three formations are heavily overlap with the original data, which undermines their ability to be supplementary while the user-posted images are extremely dependent on human annotation, which not only limits its application scope to just a handful of text-image datasets, but also propagates the errors derived from human mistakes to the entire downstream loop. In this study, we explore the way of generating the sentimental image that no one has ever ventured before. We propose a novel Sentimental Image Generation method that can precisely provide ancillary visual semantics to reinforce the textual extraction as shown in Figure 1. Extensive experiments build a new SOTA performance in ACOS, ASQP and en-Phone datasets, underscoring the effectiveness of our method and highlighting a promising direction for expanding our features.
"本文聚焦于“叙实性推理”任务,即判断语言中事件真实性的语义理解能力。该任务不依赖外部知识,而基于语言结构本身进行推理,对当前大语言模型(LLMs)提出挑战。为解决模型在叙实性漂移、多义词处理等方面的不足,作者提出一种结合RAG(检索增强生成)与谓词相似性的方法,构建了一个融合参数化与非参数化知识的叙实性检测智能体系统。该系统通过分步提示与知识库支持,实现了更高的一致性、准确性与可解释性,在评测任务中取得了0.9240的稳健表现。"
Word associations are commonly applied in psycholinguistics to investigate the nature and structure of the human mental lexicon, and at the same time an important data source for measuring the alignment of language models with human semantic representations.Taking this view, we compare the capacities of different language models to model collective human association norms via five word association tasks (WATs), with predictions about associations driven by either word vector similarities for traditional embedding models or prompting large language models (LLMs).Our results demonstrate that neither approach could produce human-like performances in all five WATs. Hence, none of them can successfully model the human mental lexicon yet. Our detailed analysis shows that static word-type embeddings and prompted LLMs have overall better alignment with human norms compared to word-token embeddings from pretrained models like BERT. Further analysis suggests that the performance discrepancies may be due to different model architectures, especially in terms of approximating human-like associative reasoning through either semantic similarity or relatedness evaluation. Our codes and data are publicly available at: https://github.com/florethsong/word_association.
With thousand years of history, calligraphy serve as one of the representative symbols of Chinese culture. Increasing works try to digitize calligraphy by recognizing the context of calligraphy for better preservation and propagation. However, previous works stick to isolated single character recognition, not only requires unpractical manual splitting into characters, but also abandon the enriched context information that could be supplementary. To this end, we construct the pioneering end-to-end calligraphy recognition benchmark dataset, this dataset is challenging due to both the visual variations such as different writing styles and the textual understanding such as the domain shift in semantics. We further propose CalligraphicOCR (COCR) equipped with calligraphic image augmentation and action-based corrector targeted at the challenging root of this setting. Experiments demonstrate the advantage of our proposed model over cutting-edge baselines, underscoring the necessity of introducing this new setting, thereby facilitating a solid precondition for protecting and propagating the already scarce resources.

2024

The identification of metaphor is a crucial prerequisite for many downstream language tasks, such as sentiment analysis, opinion mining, and textual entailment. State-of-the-art systems of metaphor detection implement heuristic principles such as Metaphor Identification Procedure (MIP) and Selection Preference Violation (SPV). We propose an innovative approach that leverages the cognitive information of embodiment that can be derived from word embeddings, and explicitly models the process of sensorimotor change that has been demonstrated as essential for human metaphor processing. We showed that this cognitively motivated module is effective and can improve metaphor detection, compared with the heuristic MIP that has been applied previously.
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
Traditional Ecological Knowledge (TEK) has been recognized as a shared cultural heritage and a crucial instrument to tackle today’s environmental challenges. In this paper, we deal with historical ecological knowledge, a special type of TEK that is based on ancient language texts. In particular, we aim to build a language resource based on Shanhai Jing (The Classic of Mountains and Seas). Written 2000 years ago, Shanhai Jing is a record of flora and fauna in ancient China, anchored by mountains (shan) and seas (hai). This study focuses on the entities in the Shan Jing part and builds a knowledge base for them. We adopt a pattern-driven and bottom-up strategy to accommodate two features of the source: highly stylized narrative and juxtaposition of knowledge from multiple domains. The PRF values of both entity and relationship extraction are above 96%. Quality assurance measures like entity disambiguation and resolution were done by domain experts. Neo4j graph database is used to visualize the result. We think the knowledge base, containing 1432 systematically classified entities and 3294 relationships, can provide the foundation for the construction of a historical ecological knowledge base of China. Additionally, the ruled-based text-matching method can be helpful in ancient language processing.
For a conversation to help and support, speakers should maintain an “effect-effort” tradeoff. As outlined in the gist of “Cognitive Relevance Principle”, helpful speakers should optimize the “cognitive relevance” through maximizing the “cognitive effects” and minimizing the “processing effort” imposed on listeners. Although preference learning methods have given rise a boon of studies in pursuit of“effect-optimization”, none have delved into the critical “effort-optimiazation” to fully cultivate the awareness of “optimal relevance” into thecognition of conversation agents. To address this gap, we integrate the “Cognitive Relevance Principle” into emotional support agents in the environment of multi-turn conversation. The results demonstrate a significant and robust improvement against the baseline systems with respect to response quality, human-likedness and supportivenss. This study offers compelling evidence for the effectiveness of the “Relevance Principle” in generating human-like, helpful, and harmless emotional support conversations. The source code will be available at https://github.com/CN-Eyetk/VLESA-ORL.git
This is the demonstration of systems and results of our team’s participation in the Social Medical Mining for Health (SMM4H) 2024 Shared Task. Our team participated in two tasks: Task 1 and Task 5. Task 5 requires the detection of tweet sentences that claim children’s medical disorders from certain users. Task 1 needs teams to extract and normalize Adverse Drug Event terms in the tweet sentence. The team selected several Pre-trained Language Models and generative Large Language Models to meet the requirements. Strategies to improve the performance include cloze test, prompt engineering, Low Rank Adaptation etc. The test result of our system has an F1 score of 0.935, Precision of 0.954 and Recall of 0.917 in Task 5 and an overall F1 score of 0.08 in Task 1.
As a complex task that requires rich information input, features from various aspects have been utilized in event extraction. However, most of the previous works ignored the value of glyph, which could contain enriched semantic information and can not be fully expressed by the pre-trained embedding in hieroglyphic languages like Chinese. We argue that, compared with combining the sophisticated textual features, glyphic information from visual modality could provide us with extra and straight semantic information in extracting events. Motivated by this, we propose a glyphic multi-modal Chinese event extraction model with hieroglyphic images to capture the intra- and inter-character morphological structure from the sequence. Extensive experiments build a new state-of-the-art performance in the ACE2005 Chinese and KBP Eval 2017 dataset, which underscores the effectiveness of our proposed glyphic event extraction model, and more importantly, the glyphic feature can be obtained at nearly zero cost.

2023

Recent studies suggested that language models are efficient tools for measuring lexical semantic change. In our paper, we present the compilation of the first graph-based evaluation dataset for lexical semantic change in the context of the Chinese language, specifically covering the periods of pre- and post- Reform and Opening Up. Exploiting the existing framework DURel, we collect over 61,000 human semantic relatedness judgments for 40 targets. The inferred word usage graphs and semantic change scores provide a basis for visualization and evaluation of semantic change.

2022

Inclusion, as one of the foundations in the diversity, equity, and inclusion initiative, concerns the degree of being treated as an ingroup member in a workplace. Despite of its importance in a corporate’s ecosystem, the inclusion strategies and its performance are not adequately addressed in corporate social responsibility (CSR) and CSR reporting. This study proposes a machine learning and big data-based model to examine inclusion through the use of stereotype content in actual language use. The distribution of the stereotype content in general corpora of a given society is utilized as a baseline, with which texts about corporate texts are compared. This study not only propose a model to identify and classify inclusion in language use, but also provides insights to measure and track progress by including inclusion in CSR reports as a strategy to build an inclusive corporate team.
We examine how Chinese and American oil companies use the gain- and loss-framed BUILDING source domain to legitimize their business in Corporate Social Responsibility (CSR) reports. Gain and loss frames can create legitimacy because they can ethically position an issue. We will focus on oil companies in China and the U.S. because different socio-cultural contexts in these two countries can potentially lead to different legitimation strategies in CSR reports, which can shed light on differences in Chinese and American CSR. All of the oil companies in our data are on the Fortune 500 list (2020). The results showed that Chinese oil companies used BUILDING metaphors more frequently than American oil companies. The most frequent keyword in Chinese CSRs “build” highlights environmental achievements in compliance with governments’ policies. American CSRs often used the metaphorical verb “support” to show their alignment with environmental policies and the interests of different stakeholders. The BUILDING source domain was used more often as gain frames in both Chinese and American CSR reports to show how oil companies create benefits for different stakeholders.
With the rising popularity of Transformer-based language models, several studies have tried to exploit their masked language modeling capabilities to automatically extract relational linguistic knowledge, although this kind of research has rarely investigated semantic relations in specialized domains. The present study aims at testing a general-domain and a domain-adapted Transformer models on two datasets of financial term-hypernym pairs using the prompt methodology. Our results show that the differences of prompts impact critically on models’ performance, and that domain adaptation on financial text generally improves the capacity of the models to associate the target terms with the right hypernyms, although the more successful models are those retaining a general-domain vocabulary.
Recent research has brought a wind of using computational approaches to the classic topic of semantic change, aiming to tackle one of the most challenging issues in the evolution of human language. While several methods for detecting semantic change have been proposed, such studies are limited to a few languages, where evaluation datasets are available. This paper presents the first dataset for evaluating Chinese semantic change in contexts preceding and following the Reform and Opening-up, covering a 50-year period in Modern Chinese. Following the DURel framework, we collected 6,000 human judgments for the dataset. We also reported the performance of alignment-based word embedding models on this evaluation dataset, achieving high and significant correlation scores.

2021

With the recent rise in popularity of Transformer models in Natural Language Processing, research efforts have been dedicated to the development of domain-adapted versions of BERT-like architectures. In this study, we focus on FinBERT, a Transformer model trained on text from the financial domain. By comparing its performances with the original BERT on a wide variety of financial text processing tasks, we found continual pretraining from the original model to be the more beneficial option. Domain-specific pretraining from scratch, conversely, seems to be less effective.
This paper presents the ROCLING 2021 shared task on dimensional sentiment analysis for educational texts which seeks to identify a real-value sentiment score of self-evaluation comments written by Chinese students in the both valence and arousal dimensions. Valence represents the degree of pleasant and unpleasant (or positive and negative) feelings, and arousal represents the degree of excitement and calm. Of the 7 teams registered for this shared task for two-dimensional sentiment analysis, 6 submitted results. We expected that this evaluation campaign could produce more advanced dimensional sentiment analysis techniques for the educational domain. All data sets with gold standards and scoring script are made publicly available to researchers.
We present Scikit-talk, an open-source toolkit for processing collections of real-world conversational speech in Python. First of its kind, the toolkit equips those interested in studying or modeling conversations with an easy-to-use interface to build and explore large collections of transcriptions and annotations of talk-in-interaction. Designed for applications in speech processing and Conversational AI, Scikit-talk provides tools to custom-build datasets for tasks such as intent prototyping, dialog flow testing, and conversation design. Its preprocessor module comes with several pre-built interfaces for common transcription formats, which aim to make working across multiple data sources more accessible. The explorer module provides a collection of tools to explore and analyse this data type via string matching and unsupervised machine learning techniques. Scikit-talk serves as a platform to collect and connect different transcription formats and representations of talk, enabling the user to quickly build multilingual datasets of varying detail and granularity. Thus, the toolkit aims to make working with authentic conversational speech data in Python more accessible and to provide the user with comprehensive options to work with representations of talk in appropriate detail for any downstream task. For the latest updates and information on currently supported languages and language resources, please refer to: https://pypi.org/project/scikit-talk/
Word embeddings are vectorial semantic representations built with either counting or predicting techniques aimed at capturing shades of meaning from word co-occurrences. Since their introduction, these representations have been criticized for lacking interpretable dimensions. This property of word embeddings limits our understanding of the semantic features they actually encode. Moreover, it contributes to the “black box” nature of the tasks in which they are used, since the reasons for word embedding performance often remain opaque to humans. In this contribution, we explore the semantic properties encoded in word embeddings by mapping them onto interpretable vectors, consisting of explicit and neurobiologically motivated semantic features (Binder et al. 2016). Our exploration takes into account different types of embeddings, including factorized count vectors and predict models (Skip-Gram, GloVe, etc.), as well as the most recent contextualized representations (i.e., ELMo and BERT). In our analysis, we first evaluate the quality of the mapping in a retrieval task, then we shed light on the semantic features that are better encoded in each embedding type. A large number of probing tasks is finally set to assess how the original and the mapped embeddings perform in discriminating semantic categories. For each probing task, we identify the most relevant semantic features and we show that there is a correlation between the embedding performance and how they encode those features. This study sets itself as a step forward in understanding which aspects of meaning are captured by vector spaces, by proposing a new and simple method to carve human-interpretable semantic representations from distributional vectors.
In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

2020

Collecting modality exclusivity norms for lexical items has recently become a common practice in psycholinguistics and cognitive research. However, these norms are available only for a relatively small number of languages and often involve a costly and time-consuming collection of ratings. In this work, we aim at learning a mapping between word embeddings and modality norms. Our experiments focused on crosslingual word embeddings, in order to predict modality association scores by training on a high-resource language and testing on a low-resource one. We ran two experiments, one in a monolingual and the other one in a crosslingual setting. Results show that modality prediction using off-the-shelf crosslingual embeddings indeed has moderate-to-high correlations with human ratings even when regression algorithms are trained on an English resource and tested on a completely unseen language.
While neural embeddings represent a popular choice for word representation in a wide variety of NLP tasks, their usage for thematic fit modeling has been limited, as they have been reported to lag behind syntax-based count models. In this paper, we propose a complete evaluation of count models and word embeddings on thematic fit estimation, by taking into account a larger number of parameters and verb roles and introducing also dependency-based embeddings in the comparison. Our results show a complex scenario, where a determinant factor for the performance seems to be the availability to the model of reliable syntactic information for building the distributional representations of the roles.
This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5% higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.
Deep neural network models have played a critical role in sentiment analysis with promising results in the recent decade. One of the essential challenges, however, is how external sentiment knowledge can be effectively utilized. In this work, we propose a novel affection-driven approach to incorporating affective knowledge into neural network models. The affective knowledge is obtained in the form of a lexicon under the Affect Control Theory (ACT), which is represented by vectors of three-dimensional attributes in Evaluation, Potency, and Activity (EPA). The EPA vectors are mapped to an affective influence value and then integrated into Long Short-term Memory (LSTM) models to highlight affective terms. Experimental results show a consistent improvement of our approach over conventional LSTM models by 1.0% to 1.5% in accuracy on three large benchmark datasets. Evaluations across a variety of algorithms have also proven the effectiveness of leveraging affective terms for deep model enhancement.
Mandarin Alphabetical Word (MAW) is one indispensable component of Modern Chinese that demonstrates unique code-mixing idiosyncrasies influenced by language exchanges. Yet, this interesting phenomenon has not been properly addressed and is mostly excluded from the Chinese language system. This paper addresses the core problem of MAW identification and proposes to construct a large collection of MAWs from Sina Weibo (SMAW) using an automatic web-based technique which includes rule-based identification, informatics-based extraction, as well as Baidu search engine validation. A collection of 16,207 qualified SMAWs are obtained using this technique along with an annotated corpus of more than 200,000 sentences for linguistic research and applicable inquiries.
Automatic Chinese irony detection is a challenging task, and it has a strong impact on linguistic research. However, Chinese irony detection often lacks labeled benchmark datasets. In this paper, we introduce Ciron, the first Chinese benchmark dataset available for irony detection for machine learning models. Ciron includes more than 8.7K posts, collected from Weibo, a micro blogging platform. Most importantly, Ciron is collected with no pre-conditions to ensure a much wider coverage. Evaluation on seven different machine learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.
In linguistics and cognitive science, Logical metonymies are defined as type clashes between an event-selecting verb and an entity-denoting noun (e.g. The editor finished the article), which are typically interpreted by inferring a hidden event (e.g. reading) on the basis of contextual cues. This paper tackles the problem of logical metonymy interpretation, that is, the retrieval of the covert event via computational methods. We compare different types of models, including the probabilistic and the distributional ones previously introduced in the literature on the topic. For the first time, we also tested on this task some of the recent Transformer-based models, such as BERT, RoBERTa, XLNet, and GPT-2. Our results show a complex scenario, in which the best Transformer-based models and some traditional distributional models perform very similarly. However, the low performance on some of the testing datasets suggests that logical metonymy is still a challenging phenomenon for computational modeling.

2019

In this paper, we propose a new type of semantic representation of Construction Grammar that combines constructions with the vector representations used in Distributional Semantics. We introduce a new framework, Distributional Construction Grammar, where grammar and meaning are systematically modeled from language use, and finally, we discuss the kind of contributions that distributional models can provide to CxG representation from a linguistic and cognitive perspective.
In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

2018

In sentiment analysis (SA) of product reviews, both user and product information are proven to be useful. Current tasks handle user profile and product information in a unified model which may not be able to learn salient features of users and products effectively. In this work, we propose a dual user and product memory network (DUPMN) model to learn user profiles and product reviews using separate memory networks. Then, the two representations are used jointly for sentiment prediction. The use of separate models aims to capture user profiles and product information more effectively. Compared to state-of-the-art unified prediction models, the evaluations on three benchmark datasets, IMDB, Yelp13, and Yelp14, show that our dual learning model gives performance gain of 0.6%, 1.2%, and 0.9%, respectively. The improvements are also deemed very significant measured by p-values.

2017

Automatic fake news detection is an important, yet very challenging topic. Traditional methods using lexical features have only very limited success. This paper proposes a novel method to incorporate speaker profiles into an attention based LSTM model for fake news detection. Speaker profiles contribute to the model in two ways. One is to include them in the attention model. The other includes them as additional input data. By adding speaker profiles such as party affiliation, speaker title, location and credit history, our model outperforms the state-of-the-art method by 14.5% in accuracy using a benchmark fake news detection dataset. This proves that speaker profiles provide valuable information to validate the credibility of news articles.
Metaphor detection has been both challenging and rewarding in natural language processing applications. This study offers a new approach based on eventive information in detecting metaphors by leveraging the Chinese writing system, which is a culturally bound ontological system organized according to the basic concepts represented by radicals. As such, the information represented is available in all Chinese text without pre-processing. Since metaphor detection is another culturally based conceptual representation, we hypothesize that sub-textual information can facilitate the identification and classification of the types of metaphoric events denoted in Chinese text. We propose a set of syntactic conditions crucial to event structures to improve the model based on the classification of radical groups. With the proposed syntactic conditions, the model achieves a performance of 0.8859 in terms of F-scores, making 1.7% of improvement than the same classifier with only Bag-of-word features. Results show that eventive information can improve the effectiveness of metaphor detection. Event information is rooted in every language, and thus this approach has a high potential to be applied to metaphor detection in other languages.
Attention models are proposed in sentiment analysis because some words are more important than others. However,most existing methods either use local context based text information or user preference information. In this work, we propose a novel attention model trained by cognition grounded eye-tracking data. A reading prediction model is first built using eye-tracking data as dependent data and other features in the context as independent data. The predicted reading time is then used to build a cognition based attention (CBA) layer for neural sentiment analysis. As a comprehensive model, We can capture attentions of words in sentences as well as sentences in documents. Different attention mechanisms can also be incorporated to capture other aspects of attentions. Evaluations show the CBA based method outperforms the state-of-the-art local context based attention methods significantly. This brings insight to how cognition grounded data can be brought into NLP tasks.

2016

In this paper, we claim that Vector Cosine ― which is generally considered one of the most efficient unsupervised measures for identifying word similarity in Vector Space Models ― can be outperformed by a completely unsupervised measure that evaluates the extent of the intersection among the most associated contexts of two target words, weighting such intersection according to the rank of the shared contexts in the dependency ranked lists. This claim comes from the hypothesis that similar words do not simply occur in similar contexts, but they share a larger portion of their most relevant contexts compared to other related words. To prove it, we describe and evaluate APSyn, a variant of Average Precision that ― independently of the adopted parameters ― outperforms the Vector Cosine and the co-occurrence on the ESL and TOEFL test sets. In the best setting, APSyn reaches 0.73 accuracy on the ESL dataset and 0.70 accuracy in the TOEFL dataset, beating therefore the non-English US college applicants (whose average, as reported in the literature, is 64.50%) and several state-of-the-art approaches.
Many NLP tasks involve sentence-level annotation yet the relevant information is not encoded at sentence level but at some relevant parts of the sentence. Such tasks include but are not limited to: sentiment expression annotation, product feature annotation, and template annotation for Q&A systems. However, annotation of the full corpus sentence by sentence is resource intensive. In this paper, we propose an approach that iteratively extracts frequent parts of sentences for annotating, and compresses the set of sentences after each round of annotation. Our approach can also be used in preparing training sentences for binary classification (domain-related vs. noise, subjectivity vs. objectivity, etc.), assuming that sentence-type annotation can be predicted by annotation of the most relevant sub-sentences. Two experiments are performed to test our proposal and evaluated in terms of time saved and agreement of annotation.
Synaesthesia is a type of metaphor associating linguistic expressions that refer to two different sensory modalities. Previous studies, based on the analysis of poetic texts, have shown that synaesthetic transfers tend to go from the lower toward the higher senses (e.g., sweet music vs. musical sweetness). In non-literary language synaesthesia is rare, and finding a sufficient number of examples manually would be too time-consuming. In order to verify whether the directionality also holds for conventional synaesthesia found in non-literary texts, an automatic procedure for the identification of instances of synaesthesia is therefore highly desirable. In this paper, we first focus on the preliminary step of this procedure, that is, the creation of a controlled lexicon of perception. Next, we present the results of a small pilot study that applies the extraction procedure to English and Italian corpus data.
ROOT9 is a supervised system for the classification of hypernyms, co-hyponyms and random words that is derived from the already introduced ROOT13 (Santus et al., 2016). It relies on a Random Forest algorithm and nine unsupervised corpus-based features. We evaluate it with a 10-fold cross validation on 9,600 pairs, equally distributed among the three classes and involving several Parts-Of-Speech (i.e. adjectives, nouns and verbs). When all the classes are present, ROOT9 achieves an F1 score of 90.7%, against a baseline of 57.2% (vector cosine). When the classification is binary, ROOT9 achieves the following results against the baseline. hypernyms-co-hyponyms 95.7% vs. 69.8%, hypernyms-random 91.8% vs. 64.1% and co-hyponyms-random 97.8% vs. 79.4%. In order to compare the performance with the state-of-the-art, we have also evaluated ROOT9 in subsets of the Weeds et al. (2014) datasets, proving that it is in fact competitive. Finally, we investigated whether the system learns the semantic relation or it simply learns the prototypical hypernyms, as claimed by Levy et al. (2015). The second possibility seems to be the most likely, even though ROOT9 can be trained on negative examples (i.e., switched hypernyms) to drastically reduce this bias.
In the design of controlled experiments with language stimuli, researchers from psycholinguistic, neurolinguistic, and related fields, require language resources that isolate variables known to affect language processing. This article describes a freely available database that provides word level statistics for words and nonwords of Mandarin, Chinese. The featured lexical statistics include subtitle corpus frequency, phonological neighborhood density, neighborhood frequency, and homophone density. The accompanying word descriptors include pinyin, ascii phonetic transcription (sampa), lexical tone, syllable structure, dominant PoS, and syllable, segment and pinyin lengths for each phonological word. It is designed for researchers particularly concerned with language processing of isolated words and made to accommodate multiple existing hypotheses concerning the structure of the Mandarin syllable. The database is divided into multiple files according to the desired search criteria: 1) the syllable segmentation schema used to calculate density measures, and 2) whether the search is for words or nonwords. The database is open to the research community at https://github.com/karlneergaard/Mandarin-Neighborhood-Statistics.
Distributional semantic models (DSMs) are currently being used in the measurement of word relatedness and word similarity. One shortcoming of DSMs is that they do not provide a principled way to discriminate different semantic relations. Several approaches have been adopted that rely on annotated data either in the training of the model or later in its evaluation. In this paper, we introduce a dataset for training and evaluating DSMs on semantic relations discrimination between words, in Mandarin, Chinese. The construction of the dataset followed EVALution 1.0, which is an English dataset for the training and evaluating of DSMs. The dataset contains 360 relation pairs, distributed in five different semantic relations, including antonymy, synonymy, hypernymy, meronymy and nearsynonymy. All relation pairs were checked manually to estimate their quality. In the 360 word relation pairs, there are 373 relata. They were all extracted and subsequently manually tagged according to their semantic type. The relatas’ frequency was calculated in a combined corpus of Sinica and Chinese Gigaword. To the best of our knowledge, EVALution-MAN is the first of its kind for Mandarin, Chinese.

2015

2014

This paper presents the development of a Chinese event-based emotion corpus. It specifically describes the corpus design, collection and annotation. The proposed annotation scheme provides a consistent way of identifying some emotion-associated events (namely pre-events and post-events). Corpus data show that there are significant interactions between emotions and pre-events as well as that of between emotion and post-events. We believe that emotion as a pivot event underlies an innovative approach towards a linguistic model of emotion as well as automatic emotion detection and classification.

2013

2012

We adopt the corpus-informed approach to example sentence selections for the construction of a reference grammar. In the process, a database containing sentences that are carefully selected by linguistic experts including the full range of linguistic facts covered in an authoritative Chinese Reference Grammar is constructed and structured according to the reference grammar. A search engine system is developed to facilitate the process of finding the most typical examples the users need to study a linguistic problem or prove their hypotheses. The database can also be used as a training corpus by computational linguists to train models for Chinese word segmentation, POS tagging and sentence parsing.

2011

2010

Emotion processing has always been a great challenge. Given the fact that an emotion is triggered by cause events and that cause events are an integral part of emotion, this paper constructs a Chinese emotion cause corpus as a first step towards automatic inference of cause-emotion correlation. The corpus focuses on five primary emotions, namely happiness, sadness, fear, anger, and surprise. It is annotated with emotion cause events based on our proposed annotation scheme. Corpus data shows that most emotions are expressed with causes, and that causes mostly occur before the corresponding emotion verbs. We also examine the correlations between emotions and cause events in terms of linguistic cues: causative verbs, perception verbs, epistemic markers, conjunctions, prepositions, and others. Results show that each group of linguistic cues serves as an indicator marking the cause events in different structures of emotional constructions. We believe that the emotion cause corpus will be the useful resource for automatic emotion cause detection as well as emotion detection and classification.
Automatic acquisition of novel compounds is notoriously difficult because most novel compounds have relatively low frequency in a corpus. The current study proposes a new method to deal with the novel compound acquisition challenge. We model this task as a two-class classification problem in which a candidate compound is either classified as a compound or a non-compound. A machine learning method using SVM, incorporating two types of linguistically motivated features: semantic features and character features, is applied to identify rare but valid noun compounds. We explore two kinds of training data: one is virtual training data which is obtained by three statistical scores, i.e. co-occurrence frequency, mutual information and dependent ratio, from the frequent compounds; the other is real training data which is randomly selected from the infrequent compounds. We conduct comparative experiments, and the experimental results show that even with limited direct evidence in the corpus for the novel compounds, we can make full use of the typical frequent compounds to help in the discovery of the novel compounds.

2009

2008

The measurement of conceptual similarity in a hierarchical structure has been proposed by studies such as Wu and Palmer (1994) which have been summarized and evaluated in Budanisky and Hirst (2006). The present study applies the measurement of conceptual similarity to conceptual metaphor research by comparing concreteness of ontological resource nodes to several prototypical concrete nodes selected by human subjects. Here, the purpose of comparing conceptual similarity between nodes is to select a concrete sense for a word which is used metaphorically. Through using WordNet-SUMO interface such as SinicaBow (Huang, Chang and Lee, 2004), concrete senses of a lexicon will be selected once its SUMO nodes have been compared in terms of conceptual similarity with the prototypical concrete nodes. This study has strong implications for the interaction of psycholinguistic and computational linguistic fields in conceptual metaphor research.
We propose a set of heuristics for improving annotation quality of very large corpora efficiently. The Xinhua News portion of the Chinese Gigaword Corpus was tagged independently with both the Peking University ICL tagset and the Academia Sinica CKIP tagset. The corpus-based POS tags mapping will serve as the basis of the possible contrast in grammatical systems between PRC and Taiwan. And it can serve as the basic model for mapping between the CKIP and ICL tagging systems for any data.
We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (“Kybots”). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.
Chinese writing system is not only used by Chinese but also used by Japanese. The motivation of this paper is to extend the architecture of Hantology which describes the features of Chinese writing system to integrate Japan Kanji and Chinese characters into the same ontology. The problem is Chinese characters adopted by Japan have been changed, thus, the modification of the original architecture of Hantology is needed. A extended architecture consists orthographic, pronunciation, sense and derived lexicon dimensions. is proposed in this paper. The contribution of this study is that the extension architecture of Hantology provides a platform to analyze the variation of Chinese characters used in Japan. The analytic results of variation for a specific Kanji can be integrated into Hantology, so it is easier to study the variation of Chinese characters systematically
Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.

2007

2006

Tagging as the most crucial annotation of language resources can still be challenging when the corpus size is big and when the corpus data is not homogeneous. The Chinese Gigaword Corpus is confounded by both challenges. The corpus containsroughly 1.12 billion Chinese characters from two heterogeneous sources: respective news in Taiwan and in Mainland China. In other words, in addition to its size, the data also contains two variants of Chinese that are known to exhibit substantial linguistic differences. We utilize Chinese Sketch Engine as the corpus query tool, by which grammar behaviours of the two heterogeneous resources could be captured and displayed in a unified web interface. In this paper, we report our answer to the two challenges to effectively tag this large-scale corpus. The evaluation result shows our mechanism of tagging maintains high annotation quality.
Hantology, a character-based Chinese language resource is created to provide an infrastructure for language processing and research on the writing system. Unlike alphabetic or syllabic writing systems, the ideographic writing system of Chinese poses both a challenge and an opportunity. The challenge is that a totally different resources structure must be created to represent and process speaker’s conventionalization of the language. The rare opportunity is that the structure itself is enriched with conceptual classification and can be utilized for ontology building. We describe the contents and possible applications of Hantology in this paper. The applications of Hantology include: (1) an account for the diachronic development of Chinese lexica (2) character-based language processing, (3) a study of conceptual structure differences in Chinese and English, and (4) comparisons of different ideographic writing systems.

2005

2004

2003

2002

2001

2000

1999

1998

1997

1996

1995

1994

1993

1992

1991

1990

1989

1988

Search
Co-authors
Fix author