Stylometry in a Bilingual Setup

Silvie Cinková; Jan Rybicki

Stylometry in a Bilingual Setup

Abstract

The method of stylometry by most frequent words does not allow direct comparison of original texts and their translations, i.e. across languages. For instance, in a bilingual Czech-German text collection containing parallel texts (originals and translations in both directions, along with Czech and German translations from other languages), authors would not cluster across languages, since frequency word lists for any Czech texts are obviously going to be more similar to each other than to a German text, and the other way round. We have tried to come up with an interlingua that would remove the language-specific features and possibly keep the linguistically independent features of individual author signal, if they exist. We have tagged, lemmatized, and parsed each language counterpart with the corresponding language model in UDPipe, which provides a linguistic markup that is cross-lingual to a significant extent. We stripped the output of language-dependent items, but that alone did not help much. As a next step, we transformed the lemmas of both language counterparts into shared pseudolemmas based on a very crude Czech-German glossary, with a 95.6% success. We show that, for stylometric methods based on the most frequent words, we can do without translations.

Anthology ID:: 2020.lrec-1.123
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 977–984
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.123/
DOI:
Bibkey:
Cite (ACL):: Silvie Cinkova and Jan Rybicki. 2020. Stylometry in a Bilingual Setup. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 977–984, Marseille, France. European Language Resources Association.
Cite (Informal):: Stylometry in a Bilingual Setup (Cinkova & Rybicki, LREC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.lrec-1.123.pdf

PDF Cite Search Fix data