2020
pdf
bib
abs
Adding a Syntactic Annotation Level to the Corpus of Contemporary Romanian Language
Andrei Scutelnicu
|
Catalina Maranduc
|
Dan Cristea
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
In this paper we present an experiment of augmenting the Corpus of Contemporary Romanian Language (CoRoLa) with the syntactic level of annotations, which would allow users to address queries about the syntax of Romanian sentences, in the Universal Dependency model. After a short introduction of CoRoLa, we describe the treebanks used to train the dependency parser, we show the evaluation results and the process of upgrading CoRoLa with the new level of annotations. The parser displaying the best accuracy with respect to recognition of heads and relations, out of three variants trained on manually built treebanks, was chosen. Keywords: Syntactic annotation, treebank, corpus, maltparser
2017
pdf
bib
Syntactic Semantic Correspondence in Dependency Grammar
Cătălina Mărănduc
|
Cătălin Mititelu
|
Victoria Bobicev
Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories
pdf
bib
abs
A Multiform Balanced Dependency Treebank for Romanian
Mihaela Colhon
|
Cătălina Mărănduc
|
Cătălin Mititelu
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017
The UAIC-RoDia-DepTb is a balanced treebank, containing texts in non-standard language: 2,575 chats sentences, old Romanian texts (a Gospel printed in 1648, a codex of laws printed in 1818, a novel written in 1910), regional popular poetry, legal texts, Romanian and foreign fiction, quotations. The proportions are comparable; each of these types of texts is represented by subsets of at least 1,000 phrases, so that the parser can be trained on their peculiarities. The annotation of the treebank started in 2007, and it has classical tags, such as those in school grammar, with the intention of using the resource for didactic purposes. The classification of circumstantial modifiers is rich in semantic information. We present in this paper the development in progress of this resource which has been automatically annotated and entirely manually corrected. We try to add new texts, and to make it available in more formats, by keeping all the morphological and syntactic information annotated, and adding logical-semantic information. We will describe here two conversions, from the classic syntactic format into Universal Dependencies format and into a logical-semantic layer, which will be shortly presented.
pdf
bib
abs
A Diachronic Corpus for Romanian (RoDia)
Ludmila Malahov
|
Cătălina Mărănduc
|
Alexandru Colesnicov
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
This paper describes a Romanian Dependency Treebank, built at the Al. I. Cuza University (UAIC), and a special OCR techniques used to build it. The corpus has rich morphological and syntactic annotation. There are few annotated representative corpora in Romanian, and the existent ones are mainly focused on the contemporary Romanian standard. The corpus described below is focused on the non-standard aspects of the language, the Regional and the Old Romanian. Having the intention to participate at the PROIEL project, which aligns oldest New Testaments, we annotate the first printed Romanian New Testament (Alba Iulia, 1648). We began by applying the UAIC tools for the morphological and syntactic processing of Contemporary Romanian over the book’s first quarter (second edition). By carefully manually correcting the result of the automated annotation (having a modest accuracy) we obtained a sub-corpus for the training of tools for the Old Romanian processing. But the first edition of the New Testament is written in Cyrillic letters. The existence of books printed in the Old Cyrillic alphabet is a common problem for Romania and The Republic of Moldova, countries where the Romanian is spoken; a problem to solve by the joint efforts of the NLP researchers in the two countries.
pdf
bib
abs
Tools for Building a Corpus to Study the Historical and Geographical Variation of the Romanian Language
Victoria Bobicev
|
Cătălina Mărănduc
|
Cenel Augusto Perez
Proceedings of the First Workshop on Language technology for Digital Humanities in Central and (South-)Eastern Europe
Contemporary standard language corpora are ideal for NLP. There are few morphologically and syntactically annotated corpora for Romanian, and those existing or in progress only deal with the Contemporary Romanian standard. However, the necessity to study the dynamics of natural languages gave rise to balanced corpora, containing non-standard texts. In this paper, we describe the creation of tools for processing non-standard Romanian to build a big balanced corpus. We want to preserve in annotated form as many early stages of language as possible. We have already built a corpus in Old Romanian. We also intend to include the South-Danube dialects, remote to the standard language, along with regional forms closer to the standard. We try to preserve data about endangered idioms such as Aromanian, Meglenoromanian and Istroromanian dialects, and calculate the distance between different regional variants, including the language spoken in the Republic of Moldova. This distance, as well as the mutual understanding between the speakers, is the correct criterion for the classification of idioms as different languages, or as dialects, or as regional variants close to the standard.
2015
pdf
bib
Universal and Language-specific Dependency Relations for Analysing Romanian
Verginica Barbu Mititelu
|
Cătălina Mărănduc
|
Elena Irimia
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)