Karel Kučera
2014
Corpus of 19th-century Czech Texts: Problems and Solutions
Karel Kučera
|
Martin Stluka
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Although the Czech language of the 19th century represents the roots of modern Czech and many features of the 20th- and 21st-century language cannot be properly understood without this historical background, the 19th-century Czech has not been thoroughly and consistently researched so far. The long-term project of a corpus of 19th-century Czech printed texts, currently in its third year, is intended to stimulate the research as well as to provide a firm material basis for it. The reason why, in our opinion, the project is worth mentioning is that it is faced with an unusual concentration of problems following mostly from the fact that the 19th century was arguably the most tumultuous period in the history of Czech, as well as from the fact that Czech is a highly inflectional language with a long history of sound changes, orthography reforms and rather discontinuous development of its vocabulary. The paper will briefly characterize the general background of the problems and present the reasoning behind the solutions that have been implemented in the ongoing project.