2023
pdf
bib
abs
Annotation of lexical bundles with discourse functions in a Spanish academic corpus
Eleonora Guzzi
|
Margarita Alonso-Ramos
|
Marcos Garcia
|
Marcos García Salido
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
This paper describes the process of annotation of 996 lexical bundles (LB) assigned to 39 different discourse functions in a Spanish academic corpus. The purpose of the annotation is to obtain a new Spanish gold-standard corpus of 1,800,000 words useful for training and evaluating computational models that are capable of identifying automatically LBs for each context in new corpora, as well as for linguistic analysis about the role of LBs in academic discourse. The annotation process revealed that correspondence between LBs and discourse functions is not biunivocal and that the degree of ambiguity is high, so linguists’ contribution has been essential for improving the automatic assignation of tags.
2019
pdf
bib
abs
Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.
Marcos Garcia
|
Marcos García Salido
|
Susana Sotelo
|
Estela Mosqueira
|
Margarita Alonso-Ramos
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.
pdf
bib
abs
A comparison of statistical association measures for identifying dependency-based collocations in various languages.
Marcos Garcia
|
Marcos García Salido
|
Margarita Alonso-Ramos
Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019)
This paper presents an exploration of different statistical association measures to automatically identify collocations from corpora in English, Portuguese, and Spanish. To evaluate the impact of the association metrics we manually annotated corpora with three different syntactic patterns of collocations (adjective-noun, verb-object and nominal compounds). We took advantage of the PARSEME 1.1 Shared Task corpora by selecting a subset of 155k tokens in the three referred languages, in which we annotated 1,526 collocations with the corresponding Lexical Functions according to the Meaning-Text Theory. Using the resulting gold-standard, we have carried out a comparison between frequency data and several well-known association measures, both symmetric and asymmetric. The results show that the combination of dependency triples with raw frequency information is as powerful as the best association measures in most syntactic patterns and languages. Furthermore, and despite the asymmetric behaviour of collocations, directional approaches perform worse than the symmetric ones in the extraction of these phraseological combinations.
2018
pdf
bib
A Lexical Tool for Academic Writing in Spanish based on Expert and Novice Corpora
Marcos García Salido
|
Marcos García
|
Milka Villayandre-Llamazares
|
Margarita Alonso-Ramos
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2017
pdf
bib
abs
Using bilingual word-embeddings for multilingual collocation extraction
Marcos Garcia
|
Marcos García-Salido
|
Margarita Alonso-Ramos
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
This paper presents a new strategy for multilingual collocation extraction which takes advantage of parallel corpora to learn bilingual word-embeddings. Monolingual collocation candidates are retrieved using Universal Dependencies, while the distributional models are then applied to search for equivalents of the elements of each collocation in the target languages. The proposed method extracts not only collocation equivalents with direct translation between languages, but also other cases where the collocations in the two languages are not literal translations of each other. Several experiments -evaluating collocations with three syntactic patterns- in English, Spanish, and Portuguese show that our approach can effectively extract large pairs of bilingual equivalents with an average precision of about 90%. Moreover, preliminary results on comparable corpora suggest that the distributional models can be applied for identifying new bilingual collocations in different domains.
2010
pdf
bib
abs
Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora
Margarita Alonso Ramos
|
Leo Wanner
|
Orsolya Vincze
|
Gerard Casamayor del Bosque
|
Nancy Vázquez Veiga
|
Estela Mosqueira Suárez
|
Sabela Prieto González
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Collocations play a significant role in second language acquisition. In order to be able to offer efficient support to learners, an NLP-based CALL environment for learning collocations should be based on a representative collocation error annotated learner corpus. However, so far, no theoretically-motivated collocation error tag set is available. Existing learner corpora tag collocation errors simply as lexical errors ― which is clearly insufficient given the wide range of different collocation errors that the learners make. In this paper, we present a fine-grained three-dimensional typology of collocation errors that has been derived in an empirical study from the learner corpus CEDEL2 compiled by a team at the Autonomous University of Madrid. The first dimension captures whether the error concerns the collocation as a whole or one of its elements; the second dimension captures the language-oriented error analysis, while the third exemplifies the interpretative error analysis. To facilitate a smooth annotation along this typology, we adapted Knowtator, a flexible off-the-shelf annotation tool implemented as a Protégé plugin.
2008
pdf
bib
abs
Using Semantically Annotated Corpora to Build Collocation Resources
Margarita Alonso Ramos
|
Owen Rambow
|
Leo Wanner
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We present an experiment in extracting collocations from the FrameNet corpus, specifically, support verbs such as direct in Environmentalists directed strong criticism at world leaders. Support verbs do not contribute meaning of their own and the meaning of the construction is provided by the noun; the recognition of support verbs is thus useful in text understanding. Having access to a list of support verbs is also useful in applications that can benefit from paraphrasing, such as generation (where paraphrasing can provide variety). This paper starts with a brief presentation of the notion of lexical function in Meaning-Text Theory, where they fall under the notion of lexical function, and then discusses how relevant information is encoded in the FrameNet corpus. We describe the resource extracted from the FrameNet corpus.
2006
pdf
bib
abs
Local Document Relevance Clustering in IR Using Collocation Information
Leo Wanner
|
Margarita Alonso Ramos
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
A series of different automatic query expansion techniques has been suggested in Information Retrieval. To estimate how suitable a document term is as an expansion term, the most popular of them use a measure of the frequency of the co-occurrence of this term with one or several query terms. The benefit of the use of the linguistic relations that hold between query terms is often questioned. If a linguistic phenomenon is taken into account, it is the phrase structure or lexical compound. We propose a technique that is based on the restricted lexical cooccurrence (collocation) of query terms. We use the knowledge on collocations formed by query terms for two tasks: (i) document relevance clustering done in the first stage of local query expansion and (ii) choice of suitable expansion terms from the relevant document cluster. In this paper, we describe the first task, providing evidence from first preliminary experiments on Spanish material that local relevance clustering benefits largely from knowledge on collocations.
2004
pdf
bib
Enriching the Spanish EuroWordNet by Collocations
Leo Wanner
|
Margarita Alonso Ramos
|
Antonia Martí
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)