Toward a Comparable Corpus of Latvian, Russian and English Tweets

Dmitrijs Milajevs


Abstract
Twitter has become a rich source for linguistic data. Here, a possibility of building a trilingual Latvian-Russian-English corpus of tweets from Riga, Latvia is investigated. Such a corpus, once constructed, might be of great use for multiple purposes including training machine translation models, examining cross-lingual phenomena and studying the population of Riga. This pilot study shows that it is feasible to build such a resource by collecting and analysing a pilot corpus, which is made publicly available and can be used to construct a large comparable corpus.
Anthology ID:
W17-2505
Volume:
Proceedings of the 10th Workshop on Building and Using Comparable Corpora
Month:
August
Year:
2017
Address:
Vancouver, Canada
Editors:
Serge Sharoff, Pierre Zweigenbaum, Reinhard Rapp
Venue:
BUCC
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26–30
Language:
URL:
https://aclanthology.org/W17-2505
DOI:
10.18653/v1/W17-2505
Bibkey:
Cite (ACL):
Dmitrijs Milajevs. 2017. Toward a Comparable Corpus of Latvian, Russian and English Tweets. In Proceedings of the 10th Workshop on Building and Using Comparable Corpora, pages 26–30, Vancouver, Canada. Association for Computational Linguistics.
Cite (Informal):
Toward a Comparable Corpus of Latvian, Russian and English Tweets (Milajevs, BUCC 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-2505.pdf