A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus

Orhan Bilgin


Abstract
This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, i.e. extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n=1000.
Anthology ID:
2022.mwe-1.7
Volume:
Proceedings of the 18th Workshop on Multiword Expressions @LREC2022
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Archna Bhatia, Paul Cook, Shiva Taslimipoor, Marcos Garcia, Carlos Ramisch
Venue:
MWE
SIG:
SIGLEX
Publisher:
European Language Resources Association
Note:
Pages:
37–48
Language:
URL:
https://aclanthology.org/2022.mwe-1.7
DOI:
Bibkey:
Cite (ACL):
Orhan Bilgin. 2022. A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus. In Proceedings of the 18th Workshop on Multiword Expressions @LREC2022, pages 37–48, Marseille, France. European Language Resources Association.
Cite (Informal):
A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus (Bilgin, MWE 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.mwe-1.7.pdf
Optional supplementary material:
 2022.mwe-1.7.OptionalSupplementaryMaterial.pdf
Code
 melanuria/mwe_extractor