Jean-Pierre Colson

2020

HMSid and HMSid2 at PARSEME Shared Task 2020: Computational Corpus Linguistics and unseen-in-training MWEs
Jean-Pierre Colson
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

This paper is a system description of HMSid, officially sent to the PARSEME Shared Task 2020 for one language (French), in the open track. It also describes HMSid2, sent to the organ-izers of the workshop after the deadline and using the same methodology but in the closed track. Both systems do not rely on machine learning, but on computational corpus linguistics. Their score for unseen MWEs is very promising, especially in the case of HMSid2, which would have received the best score for unseen MWEs in the French closed track.

pdf bib abs

Extracting meaning by idiomaticity: Description of the HSemID system at CogALex VI (2020)
Jean-Pierre Colson
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

The HSemID system, submitted to the CogALex VI Shared Task is a hybrid system relying mainly on metric clusters measured in large web corpora, complemented by a vector space model using cosine similarity to detect semantic associations. Although the system reached ra-ther weak results for the subcategories of synonyms, antonyms and hypernyms, with some dif-ferences from one language to another, it is able to measure general semantic associations (as being random or not-random) with an F1 score close to 0.80. The results strongly suggest that idiomatic constructions play a fundamental role in semantic associations. Further experiments are necessary in order to fine-tune the model to the subcategories of synonyms, antonyms, hy-pernyms and to explain surprising differences across languages. 1 Introduction

2018

pdf bib abs

From Chinese Word Segmentation to Extraction of Constructions: Two Sides of the Same Algorithmic Coin
Jean-Pierre Colson
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper presents the results of two experiments carried out within the framework of computational construction grammar. Starting from the constructionist point of view that there are just constructions in language, including lexical ones, we tested the validity of a clustering algorithm that was primarily designed for MWE extraction, the cpr-score (Colson, 2017), on Chinese word segmentation. Our results indicate a striking recall rate of 75 percent without any special adaptation to Chinese or to the lexicon, which confirms that there is some similarity between extracting MWEs and CWS. Our second experiment also suggests that the same methodology might be used for extracting more schematic or abstract constructions, thereby providing evidence for the statistical foundation of construction grammar.

Co-authors

Venues

Fix author