Emmanuele Chersoni - ACL Anthology

Emmanuele Chersoni

2026

Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
Yiheng Yang | Yujie Wang | Chi Ma | Lei Yu | Emmanuele Chersoni | Chu-Ren Huang
Findings of the Association for Computational Linguistics: EACL 2026

Dense large language models (LLMs) face critical efficiency bottlenecks, as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods (static pruning or dynamic activation) partially address this issue, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by the human brain’s dual-process mechanisms — predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex contexts — we propose CLADA, a Cognitive-Load-Aware Dynamic Activation framework that synergizes statistical sparsity with semantic adaptability.Our key insight is that LLM activations exhibit two complementary patterns: 1. Global Statistical Sparsity driven by sequence-level prefix information, and 2. Local Semantic Adaptability modulated by cognitive load metrics (e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline derived from offline error-controlled optimization ensures over 40% sparsity, which is then dynamically adjusted using real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves 20% average speedup with less than 2% accuracy degradation, outperforming Griffin (over 5% degradation) and TT (negligible speedup).Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis (R² = 0.17), revealing a sparsity–adaptation synergy. Requiring no retraining or architectural changes, CLADA provides a deployable solution for resource-aware LLM inference while advancing biologically inspired AI design.

Large Language Models Put to the Test on Chinese Noun Compounds: Experiments on Natural Language Inference and Compound Semantics
Le Qiu | Emmanuele Chersoni | He Zhou | Yu-Yin Hsu
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

Noun compounds are generally considered an open challenge for NLP systems, given to the difficulty of interpreting the implicit semantic relation between modifier and head, although the advent of Large Language Models (LLMs) recently led to remarkable performance leaps. However, most evaluations have been carried out on English benchmarks.In our work, we test LLMs on compound semantics understanding in Chinese, adopting two different evaluation scenarios: an extrinsic evaluation in a Natural Language Inference task, and an intrinsic evaluation in which models are directly asked to predict the semantic relation linking the two constituents.Our results show that the bigger and more recent LLMs are able to surpass supervised baselines in the inference task, especially when tested under the few-shot setting. In the more challenging task of selecting the correct interpretation of the compounds out of a fine-grained typology of semantic relations between head and modifier, the best Chinese LLM (Qwen-plus) manages to select the correct option in about one third of the cases.

LST at MWE-2026 AdMIRe 2: Advancing Multimodal Idiomaticity Representation
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)

This paper presents our methods for the AdMIRe 2.0 shared task, which addresses multilingual and multimodal idiom understanding. Our submission focuses on the text-only track. Specifically, we employ an ensemble of three large language models (LLMs) to directly perform the presented image ranking task. Each model independently produces a ranking of the candidate images, and we aggregate their outputs using a hard voting strategy to determine the final prediction. This ensemble learning framework leverages the complementary strengths of different LLMs, improving robustness and reducing the variance of individual model predictions.

2025

Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation
Chu-Ren Huang | Yasunari Harada | Jong-Bok Kim | Nguyen T.M. Huyen | Le Thanh Huong | Pham Hien | Emmanuele Chersoni | Le Minh Nguyen | Rachel Edita Oñate Roxas | Sherly Dita
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata | Frederikus Hudi | Patrick Amadeus Irawan | David Anugraha | Rifki Afina Putri | Wang Yutong | Adam Nohejl | Ubaidillah Ariq Prathama | Nedjma Ousidhoum | Afifa Amriani | Anar Rzayev | Anirban Das | Ashmari Pramodya | Aulia Adila | Bryan Wilie | Candy Olivia Mawalim | Cheng Ching Lam | Daud Abolade | Emmanuele Chersoni | Enrico Santus | Fariz Ikhwantri | Garry Kuwanto | Hanyang Zhao | Haryo Akbarianto Wibowo | Holy Lovenia | Jan Christian Blaise Cruz | Jan Wira Gotama Putra | Junho Myung | Lucky Susanto | Maria Angelica Riera Machin | Marina Zhukova | Michael Anugraha | Muhammad Farid Adilazuarda | Natasha Christabelle Santosa | Peerat Limkonchotiwat | Raj Dabre | Rio Alexander Audino | Samuel Cahyawijaya | Shi-Xiong Zhang | Stephanie Yulia Salim | Yi Zhou | Yinxuan Gui | David Ifeoluwa Adelani | En-Shiun Annie Lee | Shogo Okada | Ayu Purwarianti | Alham Fikri Aji | Taro Watanabe | Derry Tanti Wijaya | Alice Oh | Chong-Wah Ngo
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.

Foodies also Need Their Own GPTs: Evaluating Language Models in Visual Question Answering on the WorldCuisines Dataset
Genta Indra Winata | Emmanuele Chersoni
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation

Reasoning or Memorization? Investigating LLMs’ Capability in Restoring Chinese Internet Homophones
Jianfei Ma | Zhaoxin Feng | Huacheng Song | Emmanuele Chersoni | Zheng Chen
Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training.In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs’ homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs’ restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches.

StockGenChaR: A Study on the Evaluation of Large Vision-Language Models on Stock Chart Captioning
Le Qiu | Emmanuele Chersoni
Proceedings of The 10th Workshop on Financial Technology and Natural Language Processing

Can LLMs Help Sun Wukong in his Journey to the West? A Case Study of Language Models in Video Game Localization
Xiaojing Zhao | Han Xu | Huacheng Song | Emmanuele Chersoni | Chu-Ren Huang
Proceedings of the First Workshop on Natural Language Processing and Language Models for Digital Humanities

Large language models (LLMs) have demonstrated increasing proficiency in general-purpose translation, yet their effectiveness in creative domains such as game localization remains underexplored. This study focuses on the role of LLMs in game localization from both linguistic quality and sociocultural adequacy through a case study of the video game Black Myth: Wukong. Results indicate that LLMs demonstrate adequate competence in accuracy and fluency, achieving performance comparable to human translators. However, limitations remain in the literal translation of culture-specific terms and offensive language. Human oversight is required to ensure nuanced cultural authenticity and sensitivity. Insights from human evaluations also suggest that current automatic metrics and the Multidimensional Quality Metrics framework may be inadequate for evaluating creative translation. Finally, varying human preferences in localization pose a learning ambiguity for LLMs to perform optimal translation strategies. The findings highlight the potential and shortcomings of LLMs to serve as collaborative tools in game localization workflows. Data are available at https://github.com/zcocozz/wukong-localization.

Which Model Mimics Human Mental Lexicon Better? A Comparative Study of Word Embedding and Generative Models
Huacheng Song | Zhaoxin Feng | Emmanuele Chersoni | Chu-Ren Huang
Proceedings of the 16th International Conference on Computational Semantics

Word associations are commonly applied in psycholinguistics to investigate the nature and structure of the human mental lexicon, and at the same time an important data source for measuring the alignment of language models with human semantic representations.Taking this view, we compare the capacities of different language models to model collective human association norms via five word association tasks (WATs), with predictions about associations driven by either word vector similarities for traditional embedding models or prompting large language models (LLMs).Our results demonstrate that neither approach could produce human-like performances in all five WATs. Hence, none of them can successfully model the human mental lexicon yet. Our detailed analysis shows that static word-type embeddings and prompted LLMs have overall better alignment with human norms compared to word-token embeddings from pretrained models like BERT. Further analysis suggests that the performance discrepancies may be due to different model architectures, especially in terms of approximating human-like associative reasoning through either semantic similarity or relatedness evaluation. Our codes and data are publicly available at: https://github.com/florethsong/word_association.

PhonoThink: Improving Large Language Models’ Reasoning on Chinese Phonological Ambiguities
Jianfei Ma | Zhaoxin Feng | Emmanuele Chersoni | Huacheng Song | Ziqi Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Effectively resolving phonological ambiguities is crucial for robust natural language processing, as these ambiguities are pervasive in tasks ranging from speech-to-text, spelling correction, to offensive language detection. However, current Large Language Models (LLMs) frequently struggle to resolve such ambiguities.To address this challenge, we present a framework to enhances LLMs’ phonological capability through a multiple-stage training approach. Our method begins with supervised fine-tuning on well-constructed datasets, including three subtask datasets designed to enhance the model’s foundational phonological knowledge, along with a synthetic dataset of step-by-step reasoning chains. Following this, we apply reinforcement learning to incentivize and stabilize its reasoning.Results show that our framework enables the base model to achieve relatively comparable performance to a much larger model. Our ablation studies reveal that subtask datasets and the synthetic dataset can simultaneously impact as complementary modular enhancers to strengthen LLMs’ integrated application.

ChengyuSTS: An Intrinsic Perspective on Mandarin Idiom Representation
Le Qiu | Emmanuele Chersoni | Aline Villavicencio
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

Chengyu, or four-character idioms, are ubiquitous in both spoken and written Chinese. Despite their importance, chengyu are often underexplored in NLP tasks, and existing evaluation frameworks are primarily based on extrinsic measures. In this paper, we introduce an intrinsic evaluation task for Chinese idiomatic understanding: idiomatic semantic textual similarity (iSTS), which evaluates how well models can capture the semantic similarity of sentences containing idioms. To this purpose, we present a curated dataset: ChengyuSTS. Our experiments show that current pre-trained sentence Transformer models generally fail to capture the idiomaticity of chengyu in a zero-shot setting. We then show results of fine-tuned models using the SimCSE contrastive learning framework, which demonstrate promising results for handling idiomatic expressions. We also presented the results of DeepSeek for reference

ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
Martina Miliani | Serena Auriemma | Alessandro Bondielli | Emmanuele Chersoni | Lucia Passaro | Irene Sucameli | Alessandro Lenci
Findings of the Association for Computational Linguistics: ACL 2025

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.

Branching Out: Exploration of Chinese Dependency Parsing with Fine-tuned Large Language Models
He Zhou | Emmanuele Chersoni | Yu-Yin Hsu
Proceedings of the 15th International Conference on Recent Advances in Natural Language Processing - Natural Language Processing in the Generative AI Era

In this paper, we investigate the effectiveness of large language models (LLMs) for Chinese dependency parsing through fine-tuning. We explore how different dependency representations impact parsing performance when fine-tuning the Chinese Llama-3 model. Our results demonstrate that while the Stanford typed dependency tuple representation yields the highest number of valid dependency trees, converting dependency structure into a lexical centered tree produces parses of significantly higher quality despite generating fewer valid structures. The results further show that fine-tuning enhances LLMs’ capability to handle longer dependencies to some extent, though challenges remain. Additionally, we evaluate the effectiveness of DeepSeek in correcting LLM-generated dependency structures, finding that it is effective for fixing index errors and cyclicity issues but still suffers from tokenization mismatches. Our analysis across dependency distances and relations reveals that fine-tuned LLMs outperform traditional parsers in specific syntactic structures while struggling with others. These findings contribute to the research on leveraging LLMs for syntactic analysis tasks.

From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
Ziqi Zhang | Jianfei Ma | Emmanuele Chersoni | You Jieshun | Zhaoxin Feng
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature.To address such a question, we employ various masking strategies to evaluate the LLMs’ intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance.Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.

Compositionality and Event Retrieval in Complement Coercion: A Study of Language Models in a Low-resource Setting
Matteo Radaelli | Emmanuele Chersoni | Alessandro Lenci | Giosuè Baggio
Proceedings of the 29th Conference on Computational Natural Language Learning

In sentences such as John began the book, the complement noun, lexically denoting an entity, is interpreted as an event. This phenomenon is known in linguistics as complement coercion: the event associated with the verb is not overtly expressed but can be recovered from the meanings of other constituents, context and world knowledge. We investigate whether language models (LMs) can exploit sentence structure and compositional meaning to recover plausible events in complement coercion. For the first time, we tested different LMs in Norwegian, a low-resource language with high syntactic variation in coercion constructions across aspectual verbs. Results reveal that LMs struggle with retrieving plausible events and with ranking them above less plausible ones. Moreover, we found that LMs do not exploit the compositional properties of coercion sentences in their predictions.

Learning to Look at the Other Side: A Semantic Probing Study of Word Embeddings in LLMs with Enabled Bidirectional Attention
Zhaoxin Feng | Jianfei Ma | Emmanuele Chersoni | Xiaojing Zhao | Xiaoyi Bao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Autoregressive Large Language Models (LLMs) demonstrate exceptional performance in language understanding and generation. However, their application in text embedding tasks has been relatively slow, along with the analysis of their semantic representation in probing tasks, due to the constraints of the unidirectional attention mechanism. This paper aims to explore whether such constraints can be overcome by enabling bidirectional attention in LLMs. We tested different variants of the Llama architecture through additional training steps, progressively enabling bidirectional attention and unsupervised/supervised contrastive learning. Our results show that bidirectional attention improves the LLMs’ ability to represent subsequent context but weakens their utilization of preceding context, while contrastive learning training can help to maintain both abilities.

Context Effects on the Interpretation of Complement Coercion: A Comparative Study with Language Models in Norwegian
Matteo Radaelli | Emmanuele Chersoni | Alessandro Lenci | Giosuè Baggio
Proceedings of the 16th International Conference on Computational Semantics

In complement coercion sentences, like *John began the book*, a covert event (e.g., reading) may be recovered based on lexical meanings, world knowledge, and context. We investigate how context influences coercion interpretation performance for 17 language models (LMs) in Norwegian, a low-resource language. Our new dataset contained isolated coercion sentences (context-neutral), plus the same sentences with a subject NP that suggests a particular covert event and sentences that have a similar effect but that precede or follow the coercion sentence. LMs generally benefit from contextual enrichment, but performance varies depending on the model. Models that struggled in context-neutral sentences showed greater improvements from contextual enrichment. Subject NPs and pre-coercion sentences had the largest effect in facilitating coercion interpretation.

2024

Comparing Static and Contextual Distributional Semantic Models on Intrinsic Tasks: An Evaluation on Mandarin Chinese Datasets
Pranav A | Yan Cong | Emmanuele Chersoni | Yu-Yin Hsu | Alessandro Lenci
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The field of Distributional Semantics has recently undergone important changes, with the contextual representations produced by Transformers taking the place of static word embeddings models. Noticeably, previous studies comparing the two types of vectors have only focused on the English language and a limited number of models. In our study, we present a comparative evaluation of static and contextualized distributional models for Mandarin Chinese, focusing on a range of intrinsic tasks. Our results reveal that static models remain stronger for some of the classical tasks that consider word meaning independent of context, while contextualized models excel in identifying semantic relations between word pairs and in the categorization of words into abstract semantic classes.

Evaluating Multilingual Language Models for Cross-Lingual ESG Issue Identification
Wing Yan Li | Emmanuele Chersoni | Cindy Sing Bik Ngai
Proceedings of the Joint Workshop of the 7th Financial Technology and Natural Language Processing, the 5th Knowledge Discovery from Unstructured Data in Financial Services, and the 4th Workshop on Economics and Natural Language Processing

The automation of information extraction from ESG reports has recently become a topic of increasing interest in the Natural Language Processing community. While such information is highly relevant for socially responsible investments, identifying the specific issues discussed in a corporate social responsibility report is one of the first steps in an information extraction pipeline. In this paper, we evaluate methods for tackling the Multilingual Environmental, Social and Governance (ESG) Issue Identification Task. Our experiments use existing datasets in English, French and Chinese with a unified label set. Leveraging multilingual language models, we compare two approaches that are commonly adopted for the given task: off-the-shelf and fine-tuning. We show that fine-tuning models end-to-end is more robust than off-the-shelf methods. Additionally, translating text into the same language has negligible performance benefits.

Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models
Carina Kauf | Emmanuele Chersoni | Alessandro Lenci | Evelina Fedorenko | Anna A Ivanova
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Semantic plausibility (e.g. knowing that “the actor won the award” is more likely than “the actor won the battle”) serves as an effective proxy for general world knowledge. Language models (LMs) capture vast amounts of world knowledge by learning distributional patterns in text, accessible via log probabilities (LogProbs) they assign to plausible vs. implausible outputs. The new generation of instruction-tuned LMs can now also provide explicit estimates of plausibility via prompting. Here, we evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility, both in single-sentence minimal pairs (Experiment 1) and short context-dependent scenarios (Experiment 2). We find that (i) in both base and instruction-tuned LMs, LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting, which yields inconsistent and often poor results; (ii) instruction-tuning generally does not alter the sensitivity of LogProbs to semantic plausibility (although sometimes decreases it); (iii) across models, context mostly modulates LogProbs in expected ways, as measured by three novel metrics of context-sensitive plausibility and their match to explicit human plausibility judgments. We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility, both in base and instruction-tuned LMs.

Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024
Michael Zock | Emmanuele Chersoni | Yu-Yin Hsu | Simon de Deyne
Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024

Investigating Aspect Features in Contextualized Embeddings with Semantic Scales and Distributional Similarity
Yuxi Li | Emmanuele Chersoni | Yu-Yin Hsu
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

Aspect, a linguistic category describing how actions and events unfold over time, is traditionally characterized by three semantic properties: stativity, durativity and telicity. In this study, we investigate whether and to what extent these properties are encoded in the verb token embeddings of the contextualized spaces of two English language models – BERT and GPT-2. First, we propose an experiment using semantic projections to examine whether the values of the vector dimensions of annotated verbs for stativity, durativity and telicity reflect human linguistic distinctions. Second, we use distributional similarity to replicate the notorious Imperfective Paradox described by Dowty (1977), and assess whether the embedding models are sensitive to capture contextual nuances of the verb telicity. Our results show that both models encode the semantic distinctions for the aspect properties of stativity and telicity in most of their layers, while durativity is the most challenging feature. As for the Imperfective Paradox, only the embedding similarities computed with the vectors from the early layers of the BERT model align with the expected pattern.

CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese
Le Qiu | Shanyue Guo | Tak-Sum Wong | Emmanuele Chersoni | John Lee | Chu-Ren Huang
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)

The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.

Can Large Language Models Interpret Noun-Noun Compounds? A Linguistically-Motivated Study on Lexicalized and Novel Compounds
Giulia Rambelli | Emmanuele Chersoni | Claudia Collacciani | Marianna Bolognesi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Noun-noun compounds interpretation is the task where a model is given one of such constructions, and it is asked to provide a paraphrase, making the semantic relation between the nouns explicit, as in carrot cake is “a cake made of carrots.” Such a task requires the ability to understand the implicit structured representation of the compound meaning. In this paper, we test to what extent the recent Large Language Models can interpret the semantic relation between the constituents of lexicalized English compounds and whether they can abstract from such semantic knowledge to predict the semantic relation between the constituents of similar but novel compounds by relying on analogical comparisons (e.g., carrot dessert). We test both Surprisal metrics and prompt-based methods to see whether i.) they can correctly predict the relation between constituents, and ii.) the semantic representation of the relation is robust to paraphrasing. Using a dataset of lexicalized and annotated noun-noun compounds, we find that LLMs can infer some semantic relations better than others (with a preference for compounds involving concrete concepts). When challenged to perform abstractions and transfer their interpretations to semantically similar but novel compounds, LLMs show serious limitations.

PolyuCBS at SMM4H 2024: LLM-based Medical Disorder and Adverse Drug Event Detection with Low-rank Adaptation
Zhai Yu | Xiaoyi Bao | Emmanuele Chersoni | Beatrice Portelli | Sophia Lee | Jinghang Gu | Chu-Ren Huang
Proceedings of the 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks

This is the demonstration of systems and results of our team’s participation in the Social Medical Mining for Health (SMM4H) 2024 Shared Task. Our team participated in two tasks: Task 1 and Task 5. Task 5 requires the detection of tweet sentences that claim children’s medical disorders from certain users. Task 1 needs teams to extract and normalize Adverse Drug Event terms in the tweet sentence. The team selected several Pre-trained Language Models and generative Large Language Models to meet the requirements. Strategies to improve the performance include cloze test, prompt engineering, Low Rank Adaptation etc. The test result of our system has an F1 score of 0.935, Precision of 0.954 and Recall of 0.917 in Task 5 and an overall F1 score of 0.08 in Task 1.

Probing Numerical Concepts in Financial Text with BERT Models
Shanyue Guo | Le Qiu | Emmanuele Chersoni
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning

Predicting Mandarin and Cantonese Adult Speakers’ Eye-Movement Patterns in Natural Reading
Li Junlin | Yu-Yin Hsu | Emmanuele Chersoni | Bo Peng
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Please find the attached PDF file for the extended abstract of our study.

2023

On Quick Kisses and How to Make Them Count: A Study on Event Construal in Light Verb Constructions with BERT
Chenxin Liu | Emmanuele Chersoni
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

Psycholinguistic studies suggested that our mental perception of events depends not only on the lexical items used to describe them, but also on the syntactic structure of the event description. More specifically, it has been argued that light verb constructions affect the perception of duration in event construal, such that the same event in this type of constructions is perceived by humans as taking less time (to give a kiss takes a shorter time than to kiss). In our paper, we present two experiments with BERT using English stimuli from psycholinguistic studies to investigate the effects of the syntactic construction on event duration and event similarity. We show that i) the dimensions of BERT vectors encode a smaller value for duration for both punctive and durative events in count syntax, in line with human results; on the other hand, we also found that ii) BERT semantic similarity fails to capture the conceptual shift that durative events should undergo in count syntax.

HKESG at the ML-ESG Task: Exploring Transformer Representations for Multilingual ESG Issue Identification
Ivan Mashkin | Emmanuele Chersoni
Proceedings of the Fifth Workshop on Financial Technology and Natural Language Processing and the Second Multimodal AI For Financial Forecasting

Identifying ESG Impact with Key Information
Le Qiu | Bo Peng | Jinghang Gu | Yu-Yin Hsu | Emmanuele Chersoni
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

The paper presents a concise summary of our work for the ML-ESG-2 shared task, exclusively on the Chinese and English datasets. ML-ESG-2 aims to ascertain the influence of news articles on corporations, specifically from an ESG perspective. To this end, we generally explored the capability of key information for impact identification and experimented with various techniques at different levels. For instance, we attempted to incorporate important information at the word level with TF-IDF, at the sentence level with TextRank, and at the document level with summarization. The final results reveal that the one with GPT-4 for summarisation yields the best predictions.

Investigating the Effect of Discourse Connectives on Transformer Surprisal: Language Models Understand Connectives, Even So They Are Surprised
Yan Cong | Emmanuele Chersoni | Yu-Yin Hsu | Philippe Blache
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

As neural language models (NLMs) based on Transformers are becoming increasingly dominant in natural language processing, several studies have proposed analyzing the semantic and pragmatic abilities of such models. In our study, we aimed at investigating the effect of discourse connectives on NLMs with regard to Transformer Surprisal scores by focusing on the English stimuli of an experimental dataset, in which the expectations about an event in a discourse fragment could be reversed by a concessive or a contrastive connective. By comparing the Surprisal scores of several NLMs, we found that bigger NLMs show patterns similar to humans’ behavioral data when a concessive connective is used, while connective-related effects tend to disappear with a contrastive one. We have additionally validated our findings with GPT-Neo using an extended dataset, and results mostly show a consistent pattern.

We Understand Elliptical Sentences, and Language Models should Too: A New Dataset for Studying Ellipsis and its Interaction with Thematic Fit
Davide Testa | Emmanuele Chersoni | Alessandro Lenci
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ellipsis is a linguistic phenomenon characterized by the omission of one or more sentence elements. Solving such a linguistic construction is not a trivial issue in natural language processing since it involves the retrieval of non-overtly expressed verbal material, which might in turn require the model to integrate human-like syntactic and semantic knowledge. In this paper, we explored the issue of how the prototypicality of event participants affects the ability of Language Models (LMs) to handle elliptical sentences and to identify the omitted arguments at different degrees of thematic fit, ranging from highly typical participants to semantically anomalous ones. With this purpose in mind, we built ELLie, the first dataset composed entirely of utterances containing different types of elliptical constructions, and structurally suited for evaluating the effect of argument thematic fit in solving ellipsis and reconstructing the missing element. Our tests demonstrated that the probability scores assigned by the models are higher for typical events than for atypical and impossible ones in different elliptical contexts, confirming the influence of prototypicality of the event participants in interpreting such linguistic structures. Finally, we conducted a retrieval task of the elided verb in the sentence in which the low performance of LMs highlighted a considerable difficulty in reconstructing the correct event.

Comparing and Predicting Eye-tracking Data of Mandarin and Cantonese
Junlin Li | Bo Peng | Yu-yin Hsu | Emmanuele Chersoni
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Eye-tracking data in Chinese languages present unique challenges due to the non-alphabetic and unspaced nature of the Chinese writing systems. This paper introduces the first deeply-annotated joint Mandarin-Cantonese eye-tracking dataset, from which we achieve a unified eye-tracking prediction system for both language varieties. In addition to the commonly studied first fixation duration and the total fixation duration, this dataset also includes the second fixation duration, expressing fixation patterns that are more relevant to higher-level, structural processing. A basic comparison of the features and measurements in our dataset revealed variation between Mandarin and Cantonese on fixation patterns related to word class and word position. The test of feature usefulness suggested that traditional features are less powerful in predicting the second-pass fixation, to which the linear distance to root makes a leading contribution in Mandarin. In contrast, Cantonese eye-movement behavior relies more on word position and part of speech.

Collecting and Predicting Neurocognitive Norms for Mandarin Chinese
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
Proceedings of the 15th International Conference on Computational Semantics

Language researchers have long assumed that concepts can be represented by sets of semantic features, and have traditionally encountered challenges in identifying a feature set that could be sufficiently general to describe the human conceptual experience in its entirety. In the dataset of English norms presented by Binder et al. (2016), also known as Binder norms, the authors introduced a new set of neurobiologically motivated semantic features in which conceptual primitives were defined in terms of modalities of neural information processing. However, no comparable norms are currently available for other languages. In our work, we built the Mandarin Chinese norm by translating the stimuli used in the original study and developed a comparable collection of human ratings for Mandarin Chinese. We also conducted some experiments on the automatic prediction of the Chinese Binder Norms based on the word embeddings of the corresponding words to assess the feasibility of modeling experiential semantic features via corpus-based representations.

Existence Justifies Reason: A Data Analysis on Chinese Classifiers Based on Eye Tracking and Transformers
Yu Wang | Emmanuele Chersoni | Chu-Ren Huang
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

Are Language Models Sensitive to Semantic Attraction? A Study on Surprisal
Yan Cong | Emmanuele Chersoni | Yu-yin Hsu | Alessandro Lenci
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

In psycholinguistics, semantic attraction is a sentence processing phenomenon in which a given argument violates the selectional requirements of a verb, but this violation is not perceived by comprehenders due to its attraction to another noun in the same sentence, which is syntactically unrelated but semantically sound. In our study, we use autoregressive language models to compute the sentence-level and the target phrase-level Surprisal scores of a psycholinguistic dataset on semantic attraction. Our results show that the models are sensitive to semantic attraction, leading to reduced Surprisal scores, although none of them perfectly matches the human behavioral pattern.

Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
Chu-Ren Huang | Yasunari Harada | Jong-Bok Kim | Si Chen | Yu-Yin Hsu | Emmanuele Chersoni | Pranav A | Winnie Huiheng Zeng | Bo Peng | Yuxi Li | Junlin Li
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

Are Frequent Phrases Directly Retrieved like Idioms? An Investigation with Self-Paced Reading and Language Models
Giulia Rambelli | Emmanuele Chersoni | Marco S. G. Senaldi | Philippe Blache | Alessandro Lenci
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)

An open question in language comprehension studies is whether non-compositional multiword expressions like idioms and compositional-but-frequent word sequences are processed differently. Are the latter constructed online, or are instead directly retrieved from the lexicon, with a degree of entrenchment depending on their frequency? In this paper, we address this question with two different methodologies. First, we set up a self-paced reading experiment comparing human reading times for idioms and both highfrequency and low-frequency compositional word sequences. Then, we ran the same experiment using the Surprisal metrics computed with Neural Language Models (NLMs). Our results provide evidence that idiomatic and high-frequency compositional expressions are processed similarly by both humans and NLMs. Additional experiments were run to test the possible factors that could affect the NLMs’ performance.

Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures
Jakob Prange | Emmanuele Chersoni
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

In this work we build upon negative results from an attempt at language modeling with predicted semantic structure, in order to establish empirical lower bounds on what could have made the attempt successful. More specifically, we design a concise binary vector representation of semantic structure at the lexical level and evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance with an end-to-end semantic-bootstrapping language model. We envision such a system as consisting of a (pretrained) sequential-neural component and a hierarchical-symbolic component working together to generate text with low surprisal and high linguistic interpretability. We find that (a) dimensionality of the semantic vector representation can be dramatically reduced without losing its main advantages and (b) lower bounds on prediction quality cannot be established via a single score alone, but need to take the distributions of signal and noise into account.

ChiWUG: A Graph-based Evaluation Dataset for Chinese Lexical Semantic Change Detection
Jing Chen | Emmanuele Chersoni | Dominik Schlechtweg | Jelena Prokic | Chu-Ren Huang
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Recent studies suggested that language models are efficient tools for measuring lexical semantic change. In our paper, we present the compilation of the first graph-based evaluation dataset for lexical semantic change in the context of the Chinese language, specifically covering the periods of pre- and post- Reform and Opening Up. Exploiting the existing framework DURel, we collect over 61,000 human semantic relatedness judgments for 40 targets. The inferred word usage graphs and semantic change scores provide a basis for visualization and evaluation of semantic change.

2022

Discovering Financial Hypernyms by Prompting Masked Language Models
Bo Peng | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

With the rising popularity of Transformer-based language models, several studies have tried to exploit their masked language modeling capabilities to automatically extract relational linguistic knowledge, although this kind of research has rarely investigated semantic relations in specialized domains. The present study aims at testing a general-domain and a domain-adapted Transformer models on two datasets of financial term-hypernym pairs using the prompt methodology. Our results show that the differences of prompts impact critically on models’ performance, and that domain adaptation on financial text generally improves the capacity of the models to associate the target terms with the right hypernyms, although the more successful models are those retaining a general-domain vocabulary.

Proceedings of the Workshop on Cognitive Aspects of the Lexicon
Michael Zock | Emmanuele Chersoni | Yu-Yin Hsu | Enrico Santus
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

PolyU-CBS at TSAR-2022 Shared Task: A Simple, Rank-Based Method for Complex Word Substitution in Two Steps
Emmanuele Chersoni | Yu-Yin Hsu
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

In this paper, we describe the system we presented at the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022) regarding the shared task on Lexical Simplification for English, Portuguese, and Spanish. We proposed an unsupervised approach in two steps: First, we used a masked language model with word masking for each language to extract possible candidates for the replacement of a difficult word; second, we ranked the candidates according to three different Transformer-based metrics. Finally, we determined our list of candidates based on the lowest average rank across different metrics.

Generalizing over Long Tail Concepts for Medical Term Normalization
Beatrice Portelli | Simone Scaboro | Enrico Santus | Hooman Sedghamiz | Emmanuele Chersoni | Giuseppe Serra
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Medical term normalization consists in mapping a piece of text to a large number of output classes.Given the small size of the annotated datasets and the extremely long tail distribution of the concepts, it is of utmost importance to develop models that are capable to generalize to scarce or unseen concepts.An important attribute of most target ontologies is their hierarchical structure. In this paper we introduce a simple and effective learning strategy that leverages such information to enhance the generalizability of both discriminative and generative models.The evaluation shows that the proposed strategy produces state-of-the-art performance on seen concepts and consistent improvements on unseen ones, allowing also for efficient zero-shot knowledge transfer across text typologies and datasets.

Pragmatic and Logical Inferences in NLI Systems: The Case of Conjunction Buttressing
Paolo Pedinotti | Emmanuele Chersoni | Enrico Santus | Alessandro Lenci
Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language

An intelligent system is expected to perform reasonable inferences, accounting for both the literal meaning of a word and the meanings a word can acquire in different contexts. A specific kind of inference concerns the connective and, which in some cases gives rise to a temporal succession or causal interpretation in contrast with the logic, commutative one (Levinson, 2000). In this work, we investigate the phenomenon by creating a new dataset for evaluating the interpretation of and by NLI systems, which we use to test three Transformer-based models. Our results show that all systems generalize patterns that are consistent with both the logical and the pragmatic interpretation, perform inferences that are inconsistent with each other, and show clear divergences with both theoretical accounts and humans’ behavior.

Evaluating Monolingual and Crosslingual Embeddings on Datasets of Word Association Norms
Trina Kwong | Emmanuele Chersoni | Rong Xiang
Proceedings of the BUCC Workshop within LREC 2022

In free word association tasks, human subjects are presented with a stimulus word and are then asked to name the first word (the response word) that comes up to their mind. Those associations, presumably learned on the basis of conceptual contiguity or similarity, have attracted for a long time the attention of researchers in linguistics and cognitive psychology, since they are considered as clues about the internal organization of the lexical knowledge in the semantic memory. Word associations data have also been used to assess the performance of Vector Space Models for English, but evaluations for other languages have been relatively rare so far. In this paper, we introduce word associations datasets for Italian, Spanish and Mandarin Chinese by extracting data from the Small World of Words project, and we propose two different tasks inspired by the previous literature. We tested both monolingual and crosslingual word embeddings on the new datasets, showing that they perform similarly in the evaluation tasks.

Exploring Nominal Coercion in Semantic Spaces with Static and Contextualized Word Embeddings
Chenxin Liu | Emmanuele Chersoni
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

The distinction between mass nouns and count nouns has a long history in formal semantics, and linguists have been trying to identify the semantic properties defining the two classes. However, they also recognized that both can undergo meaning shifts and be used in contexts of a different type, via nominal coercion. In this paper, we present an approach to measure the meaning shift in count-mass coercion in English that makes use of static and contextualized word embedding distance. Our results show that the coercion shifts are detected only by a small subset of the traditional word embedding models, and that the shifts detected by the contextualized embedding of BERT are more pronounced for mass nouns.

AILAB-Udine@SMM4H’22: Limits of Transformers and BERT Ensembles
Beatrice Portelli | Simone Scaboro | Emmanuele Chersoni | Enrico Santus | Giuseppe Serra
Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes the models developed by the AILAB-Udine team for the SMM4H’22 Shared Task. We explored the limits of Transformer based models on text classification, entity extraction and entity normalization, tackling Tasks 1, 2, 5, 6 and 10. The main takeaways we got from participating in different tasks are: the overwhelming positive effects of combining different architectures when using ensemble learning, and the great potential of generative models for term normalization.

CMCL 2022 Shared Task on Multilingual and Crosslingual Prediction of Human Reading Behavior
Nora Hollenstein | Emmanuele Chersoni | Cassandra Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

We present the second shared task on eye-tracking data prediction of the Cognitive Modeling and Computational Linguistics Workshop (CMCL). Differently from the previous edition, participating teams are asked to predict eye-tracking features from multiple languages, including a surprise language for which there were no available training data. Moreover, the task also included the prediction of standard deviations of feature values in order to account for individual differences between readers.A total of six teams registered to the task. For the first subtask on multilingual prediction, the winning team proposed a regression model based on lexical features, while for the second subtask on cross-lingual prediction, the winning team used a hybrid model based on a multilingual transformer embeddings as well as statistical features.

Compositionality as an Analogical Process: Introducing ANNE
Giulia Rambelli | Emmanuele Chersoni | Philippe Blache | Alessandro Lenci
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

Usage-based constructionist approaches consider language a structured inventory of constructions, form-meaning pairings of different schematicity and complexity, and claim that the more a linguistic pattern is encountered, the more it becomes accessible to speakers. However, when an expression is unavailable, what processes underlie the interpretation? While traditional answers rely on the principle of compositionality, for which the meaning is built word-by-word and incrementally, usage-based theories argue that novel utterances are created based on previously experienced ones through analogy, mapping an existing structural pattern onto a novel instance. Starting from this theoretical perspective, we propose here a computational implementation of these assumptions. As the principle of compositionality has been used to generate distributional representations of phrases, we propose a neural network simulating the construction of phrasal embedding as an analogical process. Our framework, inspired by word2vec and computer vision techniques, was evaluated on tasks of generalization from existing vectors.

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Nora Hollenstein | Cassandra Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Lexicon of Changes: Towards the Evaluation of Diachronic Semantic Shift in Chinese
Jing Chen | Emmanuele Chersoni | Chu-ren Huang
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

Recent research has brought a wind of using computational approaches to the classic topic of semantic change, aiming to tackle one of the most challenging issues in the evolution of human language. While several methods for detecting semantic change have been proposed, such studies are limited to a few languages, where evaluation datasets are available. This paper presents the first dataset for evaluating Chinese semantic change in contexts preceding and following the Reform and Opening-up, covering a 50-year period in Modern Chinese. Following the DURel framework, we collected 6,000 human judgments for the dataset. We also reported the performance of alignment-based word embedding models on this evaluation dataset, achieving high and significant correlation scores.

2021

Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation
Kaibao Hu | Jong-Bok Kim | Chengqing Zong | Emmanuele Chersoni
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

Modeling the Influence of Verb Aspect on the Activation of Typical Event Locations with BERT
Won Ik Cho | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

PolyU CBS-Comp at SemEval-2021 Task 1: Lexical Complexity Prediction (LCP)
Rong Xiang | Jinghang Gu | Emmanuele Chersoni | Wenjie Li | Qin Lu | Chu-Ren Huang
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

In this contribution, we describe the system presented by the PolyU CBS-Comp Team at the Task 1 of SemEval 2021, where the goal was the estimation of the complexity of words in a given sentence context. Our top system, based on a combination of lexical, syntactic, word embeddings and Transformers-derived features and on a Gradient Boosting Regressor, achieves a top correlation score of 0.754 on the subtask 1 for single words and 0.659 on the subtask 2 for multiword expressions.

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Nora Hollenstein | Cassandra Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

NADE: A Benchmark for Robust Adverse Drug Events Extraction in Face of Negations
Simone Scaboro | Beatrice Portelli | Emmanuele Chersoni | Enrico Santus | Giuseppe Serra
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Adverse Drug Event (ADE) extraction models can rapidly examine large collections of social media texts, detecting mentions of drug-related adverse reactions and trigger medical investigations. However, despite the recent advances in NLP, it is currently unknown if such models are robust in face of negation, which is pervasive across language varieties. In this paper we evaluate three state-of-the-art systems, showing their fragility against negation, and then we introduce two possible strategies to increase the robustness of these models: a pipeline approach, relying on a specific component for negation detection; an augmentation of an ADE extraction dataset to artificially create negated samples and further train the models. We show that both strategies bring significant increases in performance, lowering the number of spurious entities predicted by the models. Our dataset and code will be publicly released to encourage research on the topic.

CMCL 2021 Shared Task on Eye-Tracking Prediction
Nora Hollenstein | Emmanuele Chersoni | Cassandra L. Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Eye-tracking data from reading represent an important resource for both linguistics and natural language processing. The ability to accurately model gaze features is crucial to advance our understanding of language processing. This paper describes the Shared Task on Eye-Tracking Data Prediction, jointly organized with the eleventh edition of the Work- shop on Cognitive Modeling and Computational Linguistics (CMCL 2021). The goal of the task is to predict 5 different token- level eye-tracking metrics of the Zurich Cognitive Language Processing Corpus (ZuCo). Eye-tracking data were recorded during natural reading of English sentences. In total, we received submissions from 13 registered teams, whose systems include boosting algorithms with handcrafted features, neural models leveraging transformer language models, or hybrid approaches. The winning system used a range of linguistic and psychometric features in a gradient boosting framework.

BERT Prescriptions to Avoid Unwanted Headaches: A Comparison of Transformer Architectures for Adverse Drug Event Detection
Beatrice Portelli | Edoardo Lenzi | Emmanuele Chersoni | Giuseppe Serra | Enrico Santus
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Pretrained transformer-based models, such as BERT and its variants, have become a common choice to obtain state-of-the-art performances in NLP tasks. In the identification of Adverse Drug Events (ADE) from social media texts, for example, BERT architectures rank first in the leaderboard. However, a systematic comparison between these models has not yet been done. In this paper, we aim at shedding light on the differences between their performance analyzing the results of 12 models, tested on two standard benchmarks. SpanBERT and PubMedBERT emerged as the best models in our evaluation: this result clearly shows that span-based pretraining gives a decisive advantage in the precise recognition of ADEs, and that in-domain language pretraining is particularly useful when the transformer model is trained just on biomedical text from scratch.

Decoding Word Embeddings with Brain-Based Semantic Features
Emmanuele Chersoni | Enrico Santus | Chu-Ren Huang | Alessandro Lenci
Computational Linguistics, Volume 47, Issue 3 - November 2021

Word embeddings are vectorial semantic representations built with either counting or predicting techniques aimed at capturing shades of meaning from word co-occurrences. Since their introduction, these representations have been criticized for lacking interpretable dimensions. This property of word embeddings limits our understanding of the semantic features they actually encode. Moreover, it contributes to the “black box” nature of the tasks in which they are used, since the reasons for word embedding performance often remain opaque to humans. In this contribution, we explore the semantic properties encoded in word embeddings by mapping them onto interpretable vectors, consisting of explicit and neurobiologically motivated semantic features (Binder et al. 2016). Our exploration takes into account different types of embeddings, including factorized count vectors and predict models (Skip-Gram, GloVe, etc.), as well as the most recent contextualized representations (i.e., ELMo and BERT). In our analysis, we first evaluate the quality of the mapping in a retrieval task, then we shed light on the semantic features that are better encoded in each embedding type. A large number of probing tasks is finally set to assess how the original and the mapped embeddings perform in discriminating semantic categories. For each probing task, we identify the most relevant semantic features and we show that there is a correlation between the embedding performance and how they encode those features. This study sets itself as a step forward in understanding which aspects of meaning are captured by vector spaces, by proposing a new and simple method to carve human-interpretable semantic representations from distributional vectors.

Is Domain Adaptation Worth Your Investment? Comparing BERT and FinBERT on Financial Tasks
Bo Peng | Emmanuele Chersoni | Yu-Yin Hsu | Chu-Ren Huang
Proceedings of the Third Workshop on Economics and Natural Language Processing

With the recent rise in popularity of Transformer models in Natural Language Processing, research efforts have been dedicated to the development of domain-adapted versions of BERT-like architectures. In this study, we focus on FinBERT, a Transformer model trained on text from the financial domain. By comparing its performances with the original BERT on a wide variety of financial text processing tasks, we found continual pretraining from the original model to be the more beneficial option. Domain-specific pretraining from scratch, conversely, seems to be less effective.

Exploring a Unified Sequence-To-Sequence Transformer for Medical Product Safety Monitoring in Social Media
Shivam Raval | Hooman Sedghamiz | Enrico Santus | Tuka Alhanai | Mohammad Ghassemi | Emmanuele Chersoni
Findings of the Association for Computational Linguistics: EMNLP 2021

Adverse Events (AE) are harmful events resulting from the use of medical products. Although social media may be crucial for early AE detection, the sheer scale of this data makes it logistically intractable to analyze using human agents, with NLP representing the only low-cost and scalable alternative. In this paper, we frame AE Detection and Extraction as a sequence-to-sequence problem using the T5 model architecture and achieve strong performance improvements over the baselines on several English benchmarks (F1 = 0.71, 12.7% relative improvement for AE Detection; Strict F1 = 0.713, 12.4% relative improvement for AE Extraction). Motivated by the strong commonalities between AE tasks, the class imbalance in AE benchmarks, and the linguistic and structural variety typical of social media texts, we propose a new strategy for multi-task training that accounts, at the same time, for task and dataset characteristics. Our approach increases model robustness, leading to further performance gains. Finally, our framework shows some language transfer capabilities, obtaining higher performance than Multilingual BERT in zero-shot learning on French data.

Looking for a Role for Word Embeddings in Eye-Tracking Features Prediction: Does Semantic Similarity Help?
Lavinia Salicchi | Alessandro Lenci | Emmanuele Chersoni
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

Eye-tracking psycholinguistic studies have suggested that context-word semantic coherence and predictability influence language processing during the reading activity. In this study, we investigate the correlation between the cosine similarities computed with word embedding models (both static and contextualized) and eye-tracking data from two naturalistic reading corpora. We also studied the correlations of surprisal scores computed with three state-of-the-art language models. Our results show strong correlation for the scores computed with BERT and GloVe, suggesting that similarity can play an important role in modeling reading times.

Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge
Paolo Pedinotti | Giulia Rambelli | Emmanuele Chersoni | Enrico Santus | Alessandro Lenci | Philippe Blache
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Prior research has explored the ability of computational models to predict a word semantic fit with a given predicate. While much work has been devoted to modeling the typicality relation between verbs and arguments in isolation, in this paper we take a broader perspective by assessing whether and to what extent computational approaches have access to the information about the typicality of entire events and situations described in language (Generalized Event Knowledge). Given the recent success of Transformers Language Models (TLMs), we decided to test them on a benchmark for the dynamic estimation of thematic fit. The evaluation of these models was performed in comparison with SDM, a framework specifically designed to integrate events in sentence meaning representations, and we conducted a detailed error analysis to investigate which factors affect their behavior. Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge, and their predictions often depend on surface linguistic features, such as frequent words, collocations and syntactic patterns, thereby showing sub-optimal generalization abilities.

2020

Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources
Emmanuele Chersoni | Barry Devereux | Chu-Ren Huang
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources

Proceedings of the Workshop on the Cognitive Aspects of the Lexicon
Michael Zock | Emmanuele Chersoni | Alessandro Lenci | Enrico Santus
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

Automatic Learning of Modality Exclusivity Norms with Crosslingual Word Embeddings
Emmanuele Chersoni | Rong Xiang | Qin Lu | Chu-Ren Huang
Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics

Collecting modality exclusivity norms for lexical items has recently become a common practice in psycholinguistics and cognitive research. However, these norms are available only for a relatively small number of languages and often involve a costly and time-consuming collection of ratings. In this work, we aim at learning a mapping between word embeddings and modality norms. Our experiments focused on crosslingual word embeddings, in order to predict modality association scores by training on a high-resource language and testing on a low-resource one. We ran two experiments, one in a monolingual and the other one in a crosslingual setting. Results show that modality prediction using off-the-shelf crosslingual embeddings indeed has moderate-to-high correlations with human ratings even when regression algorithms are trained on an English resource and tested on a completely unseen language.

Using Conceptual Norms for Metaphor Detection
Mingyu Wan | Kathleen Ahrens | Emmanuele Chersoni | Menghan Jiang | Qi Su | Rong Xiang | Chu-Ren Huang
Proceedings of the Second Workshop on Figurative Language Processing

This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5% higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.

Comparing Probabilistic, Distributional and Transformer-Based Models on Logical Metonymy Interpretation
Giulia Rambelli | Emmanuele Chersoni | Alessandro Lenci | Philippe Blache | Chu-Ren Huang
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

In linguistics and cognitive science, Logical metonymies are defined as type clashes between an event-selecting verb and an entity-denoting noun (e.g. The editor finished the article), which are typically interpreted by inferring a hidden event (e.g. reading) on the basis of contextual cues. This paper tackles the problem of logical metonymy interpretation, that is, the retrieval of the covert event via computational methods. We compare different types of models, including the probabilistic and the distributional ones previously introduced in the literature on the topic. For the first time, we also tested on this task some of the recent Transformer-based models, such as BERT, RoBERTa, XLNet, and GPT-2. Our results show a complex scenario, in which the best Transformer-based models and some traditional distributional models perform very similarly. However, the low performance on some of the testing datasets suggests that logical metonymy is still a challenging phenomenon for computational modeling.

The CogALex Shared Task on Monolingual and Multilingual Identification of Semantic Relations
Rong Xiang | Emmanuele Chersoni | Luca Iacoponi | Enrico Santus
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

The shared task of the CogALex-VI workshop focuses on the monolingual and multilingual identification of semantic relations. We provided training and validation data for the following languages: English, German and Chinese. Given a word pair, systems had to be trained to identify which relation holds between them, with possible choices being synonymy, antonymy, hypernymy and no relation at all. Two test sets were released for evaluating the participating systems. One containing pairs for each of the training languages (systems were evaluated in a monolingual fashion) and the other proposing a surprise language to test the crosslingual transfer capabilities of the systems. Among the submitted systems, top performance was achieved by a transformer-based model in both the monolingual and in the multilingual setting, for all the tested languages, proving the potentials of this recently-introduced neural architecture. The shared task description and the results are available at https://sites.google.com/site/cogalexvisharedtask/.

Are Word Embeddings Really a Bad Fit for the Estimation of Thematic Fit?
Emmanuele Chersoni | Ludovica Pannitto | Enrico Santus | Alessandro Lenci | Chu-Ren Huang
Proceedings of the Twelfth Language Resources and Evaluation Conference

While neural embeddings represent a popular choice for word representation in a wide variety of NLP tasks, their usage for thematic fit modeling has been limited, as they have been reported to lag behind syntax-based count models. In this paper, we propose a complete evaluation of count models and word embeddings on thematic fit estimation, by taking into account a larger number of parameters and verb roles and introducing also dependency-based embeddings in the comparison. Our results show a complex scenario, where a determinant factor for the performance seems to be the availability to the model of reliable syntactic information for building the distributional representations of the roles.

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Cassandra Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Ciron: a New Benchmark Dataset for Chinese Irony Detection
Rong Xiang | Xuefeng Gao | Yunfei Long | Anran Li | Emmanuele Chersoni | Qin Lu | Chu-Ren Huang
Proceedings of the Twelfth Language Resources and Evaluation Conference

Automatic Chinese irony detection is a challenging task, and it has a strong impact on linguistic research. However, Chinese irony detection often lacks labeled benchmark datasets. In this paper, we introduce Ciron, the first Chinese benchmark dataset available for irony detection for machine learning models. Ciron includes more than 8.7K posts, collected from Weibo, a micro blogging platform. Most importantly, Ciron is collected with no pre-conditions to ensure a much wider coverage. Evaluation on seven different machine learning classifiers proves the usefulness of Ciron as an important resource for Chinese irony detection.

2019

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Cassandra Jacobs | Alessandro Lenci | Tal Linzen | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

PolyU_CBS-CFA at the FinSBD Task: Sentence Boundary Detection of Financial Data with Domain Knowledge Enhancement and Bilingual Training
Mingyu Wan | Rong Xiang | Emmanuele Chersoni | Natalia Klyueva | Kathleen Ahrens | Bin Miao | David Broadstock | Jian Kang | Amos Yung | Chu-Ren Huang
Proceedings of the First Workshop on Financial Technology and Natural Language Processing

Distributional Semantics Meets Construction Grammar. towards a Unified Usage-Based Model of Grammar and Meaning
Giulia Rambelli | Emmanuele Chersoni | Philippe Blache | Chu-Ren Huang | Alessandro Lenci
Proceedings of the First International Workshop on Designing Meaning Representations

In this paper, we propose a new type of semantic representation of Construction Grammar that combines constructions with the vector representations used in Distributional Semantics. We introduce a new framework, Distributional Construction Grammar, where grammar and meaning are systematically modeled from language use, and finally, we discuss the kind of contributions that distributional models can provide to CxG representation from a linguistic and cognitive perspective.

2018

A Rank-Based Similarity Metric for Word Embeddings
Enrico Santus | Hongmin Wang | Emmanuele Chersoni | Yue Zhang
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Word Embeddings have recently imposed themselves as a standard for representing word meaning in NLP. Semantic similarity between word pairs has become the most common evaluation benchmark for these representations, with vector cosine being typically used as the only similarity metric. In this paper, we report experiments with a rank-based metric for WE, which performs comparably to vector cosine in similarity estimation and outperforms it in the recently-introduced and challenging task of outlier detection, thus suggesting that rank-based measures can improve clustering quality.

BomJi at SemEval-2018 Task 10: Combining Vector-, Pattern- and Graph-based Information to Identify Discriminative Attributes
Enrico Santus | Chris Biemann | Emmanuele Chersoni
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes BomJi, a supervised system for capturing discriminative attributes in word pairs (e.g. yellow as discriminative for banana over watermelon). The system relies on an XGB classifier trained on carefully engineered graph-, pattern- and word embedding-based features. It participated in the SemEval-2018 Task 10 on Capturing Discriminative Attributes, achieving an F1 score of 0.73 and ranking 2nd out of 26 participant systems.

Modeling Violations of Selectional Restrictions with Distributional Semantics
Emmanuele Chersoni | Adrià Torrens Urrutia | Philippe Blache | Alessandro Lenci
Proceedings of the Workshop on Linguistic Complexity and Natural Language Processing

Distributional Semantic Models have been successfully used for modeling selectional preferences in a variety of scenarios, since distributional similarity naturally provides an estimate of the degree to which an argument satisfies the requirement of a given predicate. However, we argue that the performance of such models on rare verb-argument combinations has received relatively little attention: it is not clear whether they are able to distinguish the combinations that are simply atypical, or implausible, from the semantically anomalous ones, and in particular, they have never been tested on the task of modeling their differences in processing complexity. In this paper, we compare two different models of thematic fit by testing their ability of identifying violations of selectional restrictions in two datasets from the experimental studies.

2017

Measuring Thematic Fit with Distributional Feature Overlap
Enrico Santus | Emmanuele Chersoni | Alessandro Lenci | Philippe Blache
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we introduce a new distributional method for modeling predicate-argument thematic fit judgments. We use a syntax-based DSM to build a prototypical representation of verb-specific roles: for every verb, we extract the most salient second order contexts for each of its roles (i.e. the most salient dimensions of typical role fillers), and then we compute thematic fit as a weighted overlap between the top features of candidate fillers and role prototypes. Our experiments show that our method consistently outperforms a baseline re-implementing a state-of-the-art system, and achieves better or comparable results to those reported in the literature for the other unsupervised systems. Moreover, it provides an explicit representation of the features characterizing verb-specific semantic roles.

Is Structure Necessary for Modeling Argument Expectations in Distributional Semantics?
Emmanuele Chersoni | Enrico Santus | Philippe Blache | Alessandro Lenci
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Long papers

Logical Metonymy in a Distributional Model of Sentence Comprehension
Emmanuele Chersoni | Alessandro Lenci | Philippe Blache
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

In theoretical linguistics, logical metonymy is defined as the combination of an event-subcategorizing verb with an entity-denoting direct object (e.g., The author began the book), so that the interpretation of the VP requires the retrieval of a covert event (e.g., writing). Psycholinguistic studies have revealed extra processing costs for logical metonymy, a phenomenon generally explained with the introduction of new semantic structure. In this paper, we present a general distributional model for sentence comprehension inspired by the Memory, Unification and Control model by Hagoort (2013,2016). We show that our distributional framework can account for the extra processing costs of logical metonymy and can identify the covert event in a classification task.

2016

Towards a Distributional Model of Semantic Complexity
Emmanuele Chersoni | Philippe Blache | Alessandro Lenci
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

In this paper, we introduce for the first time a Distributional Model for computing semantic complexity, inspired by the general principles of the Memory, Unification and Control framework(Hagoort, 2013; Hagoort, 2016). We argue that sentence comprehension is an incremental process driven by the goal of constructing a coherent representation of the event represented by the sentence. The composition cost of a sentence depends on the semantic coherence of the event being constructed and on the activation degree of the linguistic constructions. We also report the results of a first evaluation of the model on the Bicknell dataset (Bicknell et al., 2010).

Testing APSyn against Vector Cosine on Similarity Estimation
Enrico Santus | Emmanuele Chersoni | Alessandro Lenci | Chu-Ren Huang | Philippe Blache
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

Representing Verbs with Rich Contexts: an Evaluation on Verb Similarity
Emmanuele Chersoni | Enrico Santus | Alessandro Lenci | Philippe Blache | Chu-Ren Huang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

CogALex-V Shared Task: ROOT18
Emmanuele Chersoni | Giulia Rambelli | Enrico Santus
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

In this paper, we describe ROOT 18, a classifier using the scores of several unsupervised distributional measures as features to discriminate between semantically related and unrelated words, and then to classify the related pairs according to their semantic relation (i.e. synonymy, antonymy, hypernymy, part-whole meronymy). Our classifier participated in the CogALex-V Shared Task, showing a solid performance on the first subtask, but a poor performance on the second subtask. The low scores reported on the second subtask suggest that distributional measures are not sufficient to discriminate between multiple semantic relations at once.

Co-authors

Giulia Rambelli 7

Cassandra L. Jacobs 6

Laurent Prévot 6

Beatrice Portelli 5

Nora Hollenstein 4

Giuseppe Serra 4

Huacheng Song 4

Simone Scaboro 3

Kathleen Ahrens 2

Giosuè Baggio 2

Yasunari Harada 2

Paolo Pedinotti 2

Matteo Radaelli 2

Hooman Sedghamiz 2

Genta Indra Winata 2

Xiaojing Zhao 2

David Ifeoluwa Adelani 1

Muhammad Farid Adilazuarda 1

Alham Fikri Aji 1

Afifa Amriani 1

David Anugraha 1

Michael Anugraha 1

Rio Alexander Audino 1

Serena Auriemma 1

Chris Biemann 1

Marianna Bolognesi 1

Alessandro Bondielli 1

David Broadstock 1

Samuel Cahyawijaya 1

Claudia Collacciani 1

Jan Christian Blaise Cruz 1

Barry Devereux 1

Simon De Deyne 1

Evelina Fedorenko 1

Mohammad Ghassemi 1

Frederikus Hudi 1

Le Thanh Huong 1

Nguyen T.M. Huyen 1

Luca Iacoponi 1

Fariz Ikhwantri 1

Patrick Amadeus Irawan 1

Anna A. Ivanova 1

Menghan Jiang 1

Natalia Klyueva 1

Garry Kuwanto 1

Cheng Ching Lam 1

En-Shiun Annie Lee 1

John S. Y. Lee 1

Edoardo Lenzi 1

Peerat Limkonchotiwat 1

Candy Olivia Mawalim 1

Martina Miliani 1

Cindy Sing Bik Ngai 1

Chong-Wah Ngo 1

Minh Le Nguyen 1

Nedjma Ousidhoum 1

Ludovica Pannitto 1

Lucia Passaro 1

Ashmari Pramodya 1

Ubaidillah Ariq Prathama 1

Jelena Prokić 1

Ayu Purwarianti 1

Jan Wira Gotama Putra 1

Rifki Afina Putri 1

Maria Angelica Riera Machin 1

Rachel Edita Oñate Roxas 1

Lavinia Salicchi 1

Stephanie Yulia Salim 1

Natasha Christabelle Santosa 1

Dominik Schlechtweg 1

Marco S. G. Senaldi 1

Qi Su (苏琪, 苏祺, 祺苏,) 1

Irene Sucameli 1

Lucky Susanto 1

Adrià Torrens Urrutia 1

Aline Villavicencio 1

Yu Wang (王昱) 1

Taro Watanabe 1

Haryo Akbarianto Wibowo 1

Derry Tanti Wijaya 1

Winnie Huiheng Zeng 1

Shi-Xiong Zhang 1

Marina Zhukova 1

Chengqing Zong 1

Venues