Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction

Agnès Tutin, Olivier Kraif


Abstract
This paper aims at assessing to what extent a syntax-based method (Recurring Lexico-syntactic Trees (RLT) extraction) allows us to extract large phraseological units such as prefabricated routines, e.g. “as previously said” or “as far as we/I know” in scientific writing. In order to evaluate this method, we compare it to the classical ngram extraction technique, on a subset of recurring segments including speech verbs in a French corpus of scientific writing. Results show that the LRT extraction technique is far more efficient for extended MWEs such as routines or collocations but performs more poorly for surface phenomena such as syntactic constructions or fully frozen expressions.
Anthology ID:
W17-1724
Volume:
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Month:
April
Year:
2017
Address:
Valencia, Spain
Venue:
MWE
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
176–180
Language:
URL:
https://aclanthology.org/W17-1724
DOI:
10.18653/v1/W17-1724
Bibkey:
Cite (ACL):
Agnès Tutin and Olivier Kraif. 2017. Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 176–180, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Comparing Recurring Lexico-Syntactic Trees (RLTs) and Ngram Techniques for Extended Phraseology Extraction (Tutin & Kraif, MWE 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1724.pdf