2024
pdf
bib
abs
CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin and Cantonese
Le Qiu
|
Shanyue Guo
|
Tak-Sum Wong
|
Emmanuele Chersoni
|
John Lee
|
Chu-Ren Huang
Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)
The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.
pdf
bib
Probing Numerical Concepts in Financial Text with BERT Models
Shanyue Guo
|
Le Qiu
|
Emmanuele Chersoni
Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning
2023
pdf
bib
abs
Collecting and Predicting Neurocognitive Norms for Mandarin Chinese
Le Qiu
|
Yu-Yin Hsu
|
Emmanuele Chersoni
Proceedings of the 15th International Conference on Computational Semantics
Language researchers have long assumed that concepts can be represented by sets of semantic features, and have traditionally encountered challenges in identifying a feature set that could be sufficiently general to describe the human conceptual experience in its entirety. In the dataset of English norms presented by Binder et al. (2016), also known as Binder norms, the authors introduced a new set of neurobiologically motivated semantic features in which conceptual primitives were defined in terms of modalities of neural information processing. However, no comparable norms are currently available for other languages. In our work, we built the Mandarin Chinese norm by translating the stimuli used in the original study and developed a comparable collection of human ratings for Mandarin Chinese. We also conducted some experiments on the automatic prediction of the Chinese Binder Norms based on the word embeddings of the corresponding words to assess the feasibility of modeling experiential semantic features via corpus-based representations.
pdf
bib
abs
Identifying ESG Impact with Key Information
Le Qiu
|
Bo Peng
|
Jinghang Gu
|
Yu-Yin Hsu
|
Emmanuele Chersoni
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing
The paper presents a concise summary of our work for the ML-ESG-2 shared task, exclusively on the Chinese and English datasets. ML-ESG-2 aims to ascertain the influence of news articles on corporations, specifically from an ESG perspective. To this end, we generally explored the capability of key information for impact identification and experimented with various techniques at different levels. For instance, we attempted to incorporate important information at the word level with TF-IDF, at the sentence level with TextRank, and at the document level with summarization. The final results reveal that the one with GPT-4 for summarisation yields the best predictions.