mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering

Fernando Zagatti, Paulo Augusto de Lima Medeiros, Esther da Cunha Soares, Lucas Nildaimon dos Santos Silva, Carlos Ramisch, Livy Real


Abstract
This paper introduces the mwetoolkit-lib, an adaptation of the mwetoolkit as a python library. The original toolkit performs the extraction and identification of multiword expressions (MWEs) in large text bases through the command line. One of the contributions of our work is the adaptation of the MWE extraction pipeline from the mwetoolkit, allowing its usage in python development environments and integration in larger pipelines. The other contribution is the execution of a pilot experiment aiming to show the impact of MWE discovery in data professionals’ work. This experiment found that the addition of MWE knowledge to the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization altered the word relevance order, improving the linguistic quality of the clusters returned by k-means method.
Anthology ID:
2022.mwe-1.16
Volume:
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
MWE
SIG:
SIGLEX
Publisher:
European Language Resources Association
Note:
Pages:
112–117
Language:
URL:
https://aclanthology.org/2022.mwe-1.16
DOI:
Bibkey:
Cite (ACL):
Fernando Zagatti, Paulo Augusto de Lima Medeiros, Esther da Cunha Soares, Lucas Nildaimon dos Santos Silva, Carlos Ramisch, and Livy Real. 2022. mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering. In Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 112–117, Marseille, France. European Language Resources Association.
Cite (Informal):
mwetoolkit-lib: Adaptation of the mwetoolkit as a Python Library and an Application to MWE-based Document Clustering (Zagatti et al., MWE 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.mwe-1.16.pdf