Dan Cristea


2020

pdf bib
A dual-encoding system for dialect classification
Petru Rebeja | Dan Cristea
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

In this paper we present the architecture, processing pipeline and results of the ensemble model developed for Romanian Dialect Identification task. The ensemble model consists of two TF-IDF encoders and a deep learning model aimed together at classifying input samples based on the writing patterns which are specific to each of the two dialects. Although the model performs well on the training set, its performance degrades heavily on the evaluation set. The drop in performance is due to the design decision which makes the model put too much weight on presence/lack of textual marks when determining the sample label.

pdf bib
Adding a Syntactic Annotation Level to the Corpus of Contemporary Romanian Language
Andrei Scutelnicu | Catalina Maranduc | Dan Cristea
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

In this paper we present an experiment of augmenting the Corpus of Contemporary Romanian Language (CoRoLa) with the syntactic level of annotations, which would allow users to address queries about the syntax of Romanian sentences, in the Universal Dependency model. After a short introduction of CoRoLa, we describe the treebanks used to train the dependency parser, we show the evaluation results and the process of upgrading CoRoLa with the new level of annotations. The parser displaying the best accuracy with respect to recognition of heads and relations, out of three variants trained on manually built treebanks, was chosen. Keywords: Syntactic annotation, treebank, corpus, maltparser

pdf bib
CoBiLiRo: A Research Platform for Bimodal Corpora
Dan Cristea | Ionuț Pistol | Șerban Boghiu | Anca-Diana Bibiri | Daniela Gîfu | Andrei Scutelnicu | Mihaela Onofrei | Diana Trandabăț | George Bugeag
Proceedings of the 1st International Workshop on Language Technology Platforms

This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romanian Language) research project, part of ReTeRom (Resources and Technologies for Developing Human-Machine Interfaces in Romanian). Data annotation finds increasing use in speech recognition and synthesis with the goal to support learning processes. In this context, a variety of different annotation systems for application to Speech and Text Processing environments have been presented. Even if many designs for the data annotations workflow have emerged, the process of handling metadata, to manage complex user-defined annotations, is not covered enough. We propose a design of the format aimed to serve as an annotation standard for bimodal resources, which facilitates searching, editing and statistical analysis operations over it. The design and implementation of an infrastructure that houses the resources are also presented. The goal is widening the dissemination of bimodal corpora for research valorisation and use in applications. Also, this study reports on the main operations of the web Platform which hosts the corpus and the automatic conversion flows that brings the submitted files at the format accepted by the Platform.

2018

pdf bib
A Bird’s-eye View of Language Processing Projects at the Romanian Academy
Dan Tufiș | Dan Cristea
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

bib
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017
Kalliopi Zervanou | Petya Osenova | Eveline Wandl-Vogt | Dan Cristea
Proceedings of the Workshop Knowledge Resources for the Socio-Economic Sciences and Humanities associated with RANLP 2017

2014

pdf bib
How Could Veins Speed Up The Process Of Discourse Parsing
Elena Mitocariu | Daniel Anechitei | Dan Cristea
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we propose a method of reducing the search space of a discourse parsing process, while keeping unaffected its capacity to generate cohesive and coherent tree structures. The parsing method uses Veins Theory (VT), by developing incrementally a forest of parallel discourse trees, evaluating them on cohesion and coherence criteria and keeping only the most promising structures to go on with at each step. The incremental development is constrained by two general principles, well known in discourse parsing: sequentiality of the terminal nodes and attachment restricted to the right frontier. A set of formulas rooted on VT helps to guess the most promising nodes of the right frontier where an attachment can be made, thus avoiding an exhaustive generation of the whole search space and in the same time maximizing the coherence of the discourse structures. We report good results of applying this approach, representing a significant improvement in discourse parsing process.

2012

pdf bib
Reconstructing the Diachronic Morphology of Romanian from Dictionary Citations
Dan Cristea | Radu Simionescu | Gabriela Haja
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This work represents a first step in the direction of reconstructing a diachronic morphology for Romanian. The main resource used in this task is the digital version of Romanian Language Dictionary (eDTLR). This resource offers various usage examples for its entries, citations extracted from popular Romanian texts, which often present diachronic and inflected forms of the word they are provided for. The concept of “word deformation” is introduced and classified into more categories. The research conducted aims at detecting one type of such deformations occurring in the citations ― changes only in the stem of the current word, without the migration to another paradigm. An algorithm is presented which automatically infers old stem forms. This uses a paradigmatic data model of the current Romanian morphology. Having the inferred roots and the paradigms that they are part of, old flexion forms of the words can be deduced. Even more, by considering the years in which the citations were published, the inferred old word forms can be framed in certain periods of time, creating a great resource for research in the evolution of the Romanian language.

pdf bib
Harnessing NLP Techniques in the Processes of Multilingual Content Management
Anelia Belogay | Diman Karagyozov | Svetla Koeva | Cristina Vertan | Adam Przepiórkowski | Dan Cristea | Plovios Raxis
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf bib
Sentimatrix – Multilingual Sentiment Analysis Service
Alexandru-Lucian Gînscă | Emanuela Boroș | Adrian Iftene | Diana Trandabăț | Mihai Toader | Marius Corîci | Cenel-Augusto Perez | Dan Cristea
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

2008

pdf bib
How to Evaluate and Raise the Quality in a Collaborative Lexicographic Approach
Dan Cristea | Corina Forăscu | Marius Răschip | Michael Zock
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper focuses on different aspects of collaborative work used to create the electronic version of a dictionary in paper format, edited and printed by the Romanian Academy during the last century. In order to ensure accuracy in a reasonable amount of time, collaborative proofreading of the scanned material, through an on-line interface has been initiated. The paper details the activities and the heuristics used to maximize accuracy, and to evaluate the work of anonymous contributors with diverse backgrounds. Observing the behaviour of the enterprise for a period of 6 months allows estimating the feasibility of the approach till the end of the project.

pdf bib
Anaphora Resolution Exercise: an Overview
Constantin Orăsan | Dan Cristea | Ruslan Mitkov | António Branco
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronominal anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and evaluation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took part in the formal evaluation. The paper briefly presents their results, but does not try to interpret them because in this edition of ARE our aim was not about finding why certain methods are better, but to prepare the ground for a fully-fledged edition.

2006

pdf bib
Transferring Coreference Chains through Word Alignment
Oana Postolache | Dan Cristea | Constantin Orasan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper investigates the problem of automatically annotating resources with NP coreference information using a parallel corpus, English-Romanian, in order to transfer, through word alignment, coreference chains from the English part to the Romanian part of the corpus. The results show that we can detect Romanian referential expressions and coreference chains with over 80% F-measure, thus using our method as a preprocessing step followed by manual correction as part of an annotation effort for creating a large Romanian corpus with coreference information is worthwhile.

pdf bib
Temporality in relation with discourse structure
Corina Forăscu | Ionuț Cristian Pistol | Dan Cristea
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Temporal relations between events and times are often difficult to discover, time-consuming and expensive. In this paper a corpus study is performed to derive a strong relation between discourse structure, as revealed by Veins theory, and the temporal links between entities, as addressed in the TimeML annotation standard. The data interpretation helps us gain insight on how Veins theory can improve the manual and even (semi-) automatic detection of temporal relations.

2002

pdf bib
The Use of Referential Constraints in Structuring Discourse
Violeta Seretan | Dan Cristea
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
AR-Engine - a framework for unrestricted co-reference resolution
Dan Cristea | Oana-Diana Postolache | Gabriela-Eugenia Dima | Cătălina Barbu
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

2000

pdf bib
An Empirical Investigation of the Relation Between Discourse Structure and Co-Reference
Dan Cristea | Nancy Ide | Daniel Marcu | Valentin Tablan
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
A Hierarchical Account of Referential Accessibility
Nancy Ide | Dan Cristea
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
Discourse Structure and Co-Reference: An Empirical Study
Dan Cristea | Nancy Ide | Daniel Marcu | Valentin Tablan
The Relation of Discourse/Dialogue Structure and Reference

1998

pdf bib
Veins Theory: A Model of Global Discourse Cohesion and Coherence
Dan Cristea | Nancy Ide | Laurent Romary
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Veins Theory: A Model of Global Discourse Cohesion and Coherence
Dan Cristea | Nancy Ide | Laurent Romary
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

1997

pdf bib
Expectations in Incremental Discourse Processing
Dan Cristea | Bonnie Webber
35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics