Spoken Russian in the Russian National Corpus (RNC)

Elena Grishina


Abstract
The RNC now it is a 120 million-word collection of Russian text, thus, it is the most representative and authoritative corpus of the Russian language. It is available in the Internet at www.ruscorpora.ru. The RNC contains texts of all genres and types, which covers Russian from 19 up to 21 centuries. The practice of national corpora constructing has revealed that it's indispensable to include in the RNC the sub-corpora of spoken language. Therefore, the constructors of the RNC have an intention to include in it about 10 million words of Spoken Russian. Oral speech in the Corpus is represented in the standard Russian orthography. Although this decision made impossible any phonetic exploration of the Spoken Russian Corpus, but studying Spoken Russian from any other linguistic point of view is completely available. In addition to traditional annotations (metatextual and morphological), in Spoken Sub-corpus there is sociological annotation. Unlike the standard oral speech, which is spontaneous and isn't intended to be reproduced, Multimedia Spoken Russian (MSR) is otherwise in great deal premeditated and evidently meant to be reproduced. MSR is also to be included in the RNC: first of all we plan to make the very interesting and provocative part of the RNC from the textual ingredient of about 300 Russian films.
Anthology ID:
L06-1045
Volume:
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:
May
Year:
2006
Address:
Genoa, Italy
Editors:
Nicoletta Calzolari, Khalid Choukri, Aldo Gangemi, Bente Maegaard, Joseph Mariani, Jan Odijk, Daniel Tapias
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/92_pdf.pdf
DOI:
Bibkey:
Cite (ACL):
Elena Grishina. 2006. Spoken Russian in the Russian National Corpus (RNC). In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):
Spoken Russian in the Russian National Corpus (RNC) (Grishina, LREC 2006)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/92_pdf.pdf