Joan Codina-Filba

Also published as: Joan Codina, Joan Codina-Filbà, Joan Codina-Filbá

2021

Evaluating language models for the retrieval and categorization of lexical collocations
Luis Espinosa Anke | Joan Codina-Filba | Leo Wanner
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Lexical collocations are idiosyncratic combinations of two syntactically bound lexical items (e.g., “heavy rain” or “take a step”). Understanding their degree of compositionality and idiosyncrasy, as well their underlying semantics, is crucial for language learners, lexicographers and downstream NLP applications. In this paper, we perform an exhaustive analysis of current language models for collocation understanding. We first construct a dataset of apparitions of lexical collocations in context, categorized into 17 representative semantic categories. Then, we perform two experiments: (1) unsupervised collocate retrieval using BERT, and (2) supervised collocation classification in context. We find that most models perform well in distinguishing light verb constructions, especially if the collocation’s first argument acts as subject, but often fail to distinguish, first, different syntactic structures within the same semantic category, and second, fine-grained semantic categories which restrict the use of small sets of valid collocates for a given base.

2020

pdf bib abs

CollFrEn: Rich Bilingual English–French Collocation Resource
Beatriz Fisas | Luis Espinosa Anke | Joan Codina-Filbá | Leo Wanner
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound words traditionally pose a challenge to language learners and many Natural Language Processing (NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are thus of high value. We present a manually compiled bilingual English–French collocation resource with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with information that facilitates its downstream exploitation in NLP tasks such as machine translation, word sense disambiguation, natural language generation, relation classification, and so forth. Our proposed enrichment covers: the semantic category of the collocation (its lexical function), its vector space representation (for each individual word as well as their joint collocation embedding), a subcategorization pattern of both its elements, as well as their corresponding BabelNet id, and finally, indices of their occurrences in large scale reference corpora.

2016

pdf bib abs

Praat on the Web: An Upgrade of Praat for Semi-Automatic Speech Annotation
Mónica Domínguez | Iván Latorre | Mireia Farrús | Joan Codina-Filbà | Leo Wanner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

This paper presents an implementation of the widely used speech analysis tool Praat as a web application with an extended functionality for feature annotation. In particular, Praat on the Web addresses some of the central limitations of the original Praat tool and provides (i) enhanced visualization of annotations in a dedicated window for feature annotation at interval and point segments, (ii) a dynamic scripting composition exemplified with a modular prosody tagger, and (iii) portability and an operational web interface. Speech annotation tools with such a functionality are key for exploring large corpora and designing modular pipelines.

pdf bib abs

Towards Multiple Antecedent Coreference Resolution in Specialized Discourse
Alicia Burga | Sergio Cajal | Joan Codina-Filbà | Leo Wanner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Despite the popularity of coreference resolution as a research topic, the overwhelming majority of the work in this area focused so far on single antecedence coreference only. Multiple antecedent coreference (MAC) has been largely neglected. This can be explained by the scarcity of the phenomenon of MAC in generic discourse. However, in specialized discourse such as patents, MAC is very dominant. It seems thus unavoidable to address the problem of MAC resolution in the context of tasks related to automatic patent material processing, among them abstractive summarization, deep parsing of patents, construction of concept maps of the inventions, etc. We present the first version of an operational rule-based MAC resolution strategy for patent material that covers the three major types of MAC: (i) nominal MAC, (ii) MAC with personal / relative pronouns, and MAC with reflexive / reciprocal pronouns. The evaluation shows that our strategy performs well in terms of precision and recall.

2014

pdf bib abs

An Exercise in Reuse of Resources: Adapting General Discourse Coreference Resolution for Detecting Lexical Chains in Patent Documentation
Nadjet Bouayad-Agha | Alicia Burga | Gerard Casamayor | Joan Codina | Rogelio Nazar | Leo Wanner
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Stanford Coreference Resolution System (StCR) is a multi-pass, rule-based system that scored best in the CoNLL 2011 shared task on general discourse coreference resolution. We describe how the StCR has been adapted to the specific domain of patents and give some cues on how it can be adapted to other domains. We present a linguistic analysis of the patent domain and how we were able to adapt the rules to the domain and to expand coreferences with some lexical chains. A comparative evaluation shows an improvement of the coreference resolution system, denoting that (i) StCR is a valuable tool across different text genres; (ii) specialized discourse NLP may significantly benefit from general discourse NLP research.

pdf bib

Improving Collocation Correction by Ranking Suggestions Using Linguistic Knowledge
Roberto Carlini | Joan Codina-Filba | Leo Wanner
Proceedings of the third workshop on NLP for computer-assisted language learning