Ana-Maria Barbu

Also published as: Ana Maria Barbu


2023

pdf bib
Designing the LECOR Learner Corpus for Romanian
Ana Maria Barbu | Elena Irimia | Carmen Mîrzea Vasile | Vasile Păiș
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

This article presents a work-in-progress project, which aims to build and utilize a corpus of Romanian texts written or spoken by non-native students of different nationalities, who learn Romanian as a foreign language in the one-year, intensive academic program organized by the University of Bucharest. This corpus, called LECOR – Learner Corpus for Romanian – is made up of pairs of texts: a version of the student and a corrected one of the teacher. Each version is automatically annotated with lemma and POS-tag, and the two versions are then compared, and the differences are marked as errors at this stage. The corpus also contains metadata file sets about students and their samples. In this article, the conceptual framework for building and utilization of the corpus is presented, including the acquisition and organization phases of the primary material, the annotation process, and the first attempts to adapt the NoSketch Engine query interface to the project’s objectives. The article concludes by outlining the next steps in the development of the corpus aimed at quantitative accumulation and the development of the error correction process and the complex error annotation.

2022

pdf bib
Aligning the Romanian Reference Treebank and the Valence Lexicon of Romanian Verbs
Ana-Maria Barbu | Verginica Barbu Mititelu | Cătălin Mititelu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present here the efforts of aligning two language resources for Romanian: the Romanian Reference Treebank and the Valence Lexicon of Romanian Verbs: for each occurrence of those verbs in the treebank that were included as entries in the lexicon, a set of valence frames is automatically assigned, then manually validated by two linguists and, when necessary, corrected. Validating a valence frame also means semantically disambiguating the verb in the respective context. The validation is done by two linguists, on complementary datasets. However, a subset of verbs were validated by both annotators and Cohen’s κ is 0.87 for this subset. The alignment we have made also serves as a method of enhancing the quality of the two resources, as in the process we identify morpho-syntactic annotation mistakes, incomplete valence frames or missing ones. Information from each resource complements the information from the other, thus their value increases. The treebank and the lexicon are freely available, while the links discovered between them are also made available on GitHub.

2008

pdf bib
Romanian Lexical Data Bases: Inflected and Syllabic Forms Dictionaries
Ana-Maria Barbu
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents two lexical data bases for Romanian: RoMorphoDict, a dictionary of inflected forms and RoSyllabiDict, a dictionary of syllabified inflected forms. Each data basis is available in two Unicode formats: text and XML. An entry of RoMorphoDict, in text format, contains information on inflected form, its lemma, its morpho-syntactic description and the marking of the stressed vowel in pronunciation, while in XML format, an entry, representing the whole paradigm of a word, contains further informations about roots and paradigm class. An entry of RoSyllabiDict, in both formats, contains information about unsyllabified word, its syllabified correspondent, grammatical information and/or type of syllabification, if it is the case. The stressed vowel is also marked on the syllabified form. Each lexical data base includes the corresponding inflected forms of about 65,000 lemmas, that is, over 700,000 entries in RoMorphoDict, and over 500,000 entries in RoSyllabiDict. Both resources are available for free. The paper describes in detail the content of these data bases and the procedure of building them.

2006

pdf bib
Romanian Valence Dictionary in XML Format
Ana-Maria Barbu | Emil Ionescu | Verginica Barbu Mititelu
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Valence dictionaries are dictionaries in which logical predicates (most of the times verbs) are inventoried alongside with the semantic and syntactic information regarding the role of the arguments with which they combine, as well as the syntactic restrictions these arguments have to obey. In this article we present the incipient stage of the project “Syntactic and semantic database in XML format: an HPSG representation of verb valences in Romanian”. Its aim is the development of a valence dictionary in XML format for a set of 3000 Romanian verbs. Valences are specified for each sense of each verb, alongside with an illustrative example, possible argument alternations and a set of multiword expressions in which the respective verb occurs with the respective sense. The grammatical formalism we make use of is Head-driven Phrase Structure Grammar, which offers one of the most comprehensive frames of encoding various types of linguistic information for lexical items. XML is the most appropriate mark-up language for describing information structured in HPSG framework. The project can be further on extended so that to cover all Romanian verbs (around 7000) and also other predicates (nouns, adjectives, prepositions).

2004

pdf bib
A Word Alignment System Based on a Translation Equivalence Extractor
Ana-Maria Barbu
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

2003

pdf bib
TREQ-AL: A word alignment system with limited language resources
Dan Tufiş | Ana-Maria Barbu | Radu Ion
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

2002

pdf bib
Lexical token alignment: experiments, results and applications
Dan Tufiş | Ana-Maria Barbu
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)