Carol Luca Gasan


2022

pdf bib
Romanian micro-blogging named entity recognition including health-related entities
Vasile Pais | Verginica Barbu Mititelu | Elena Irimia | Maria Mitrofan | Carol Luca Gasan | Roxana Micu
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper introduces a manually annotated dataset for named entity recognition (NER) in micro-blogging text for Romanian language. It contains gold annotations for 9 entity classes and expressions: persons, locations, organizations, time expressions, legal references, disorders, chemicals, medical devices and anatomical parts. Furthermore, word embeddings models computed on a larger micro-blogging corpus are made available. Finally, several NER models are trained and their performance is evaluated against the newly introduced corpus.

pdf bib
Challenges in Creating a Representative Corpus of Romanian Micro-Blogging Text
Vasile Pais | Maria Mitrofan | Verginica Barbu Mititelu | Elena Irimia | Roxana Micu | Carol Luca Gasan
Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-10)

Following the successful creation of a national representative corpus of contemporary Romanian language, we turned our attention to the social media text, as present in micro-blogging platforms. In this paper, we present the current activities as well as the challenges faced when trying to apply existing tools (for both annotation and indexing) to a Romanian language micro-blogging corpus. These challenges are encountered at all annotation levels, including tokenization, and at the indexing stage. We consider that existing tools for Romanian language processing must be adapted to recognize features such as emoticons, emojis, hashtags, unusual abbreviations, elongated words (commonly used for emphasis in micro-blogging), multiple words joined together (within oroutside hashtags), and code-mixed text.

2021

pdf bib
Named Entity Recognition in the Romanian Legal Domain
Vasile Pais | Maria Mitrofan | Carol Luca Gasan | Vlad Coneschi | Alexandru Ianov
Proceedings of the Natural Legal Language Processing Workshop 2021

Recognition of named entities present in text is an important step towards information extraction and natural language understanding. This work presents a named entity recognition system for the Romanian legal domain. The system makes use of the gold annotated LegalNERo corpus. Furthermore, the system combines multiple distributional representations of words, including word embeddings trained on a large legal domain corpus. All the resources, including the corpus, model and word embeddings are open sourced. Finally, the best system is available for direct usage in the RELATE platform.