An Empirical Study of Arabic Formulaic Sequence Extraction Methods

Ayman Alghamdi, Eric Atwell, Claire Brierley


Abstract
This paper aims to implement what is referred to as the collocation of the Arabic keywords approach for extracting formulaic sequences (FSs) in the form of high frequency but semantically regular formulas that are not restricted to any syntactic construction or semantic domain. The study applies several distributional semantic models in order to automatically extract relevant FSs related to Arabic keywords. The data sets used in this experiment are rendered from a new developed corpus-based Arabic wordlist consisting of 5,189 lexical items which represent a variety of modern standard Arabic (MSA) genres and regions, the new wordlist being based on an overlapping frequency based on a comprehensive comparison of four large Arabic corpora with a total size of over 8 billion running words. Empirical n-best precision evaluation methods are used to determine the best association measures (AMs) for extracting high frequency and meaningful FSs. The gold standard reference FSs list was developed in previous studies and manually evaluated against well-established quantitative and qualitative criteria. The results demonstrate that the MI.log_f AM achieved the highest results in extracting significant FSs from the large MSA corpus, while the T-score association measure achieved the worst results.
Anthology ID:
L16-1080
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
502–506
Language:
URL:
https://aclanthology.org/L16-1080
DOI:
Bibkey:
Cite (ACL):
Ayman Alghamdi, Eric Atwell, and Claire Brierley. 2016. An Empirical Study of Arabic Formulaic Sequence Extraction Methods. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 502–506, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
An Empirical Study of Arabic Formulaic Sequence Extraction Methods (Alghamdi et al., LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1080.pdf