Mike Izbicki


pdf bib
Aligning Word Vectors on Low-Resource Languages with Wiktionary
Mike Izbicki
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

Aligned word embeddings have become a popular technique for low-resource natural language processing. Most existing evaluation datasets are generated automatically from machine translations systems, so they have many errors and exist only for high-resource languages. We introduce the Wiktionary bilingual lexicon collection, which provides high-quality human annotated translations for words in 298 languages to English. We use these lexicons to train and evaluate the largest published collection of aligned word embeddings on 157 different languages. All of our code and data is publicly available at https://github.com/mikeizbicki/wiktionary_bli.


pdf bib
Multilingual Emoticon Prediction of Tweets about COVID-19
Stefanos Stoikos | Mike Izbicki
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

Emojis are a widely used tool for encoding emotional content in informal messages such as tweets,and predicting which emoji corresponds to a piece of text can be used as a proxy for measuring the emotional content in the text. This paper presents the first model for predicting emojis in highly multilingual text.Our BERTmoticon model is a fine-tuned version of the BERT model,and it can predict emojis for text written in 102 different languages.We trained our BERTmoticon model on 54.2 million geolocated tweets sent in the first 6 months of 2020,and we apply the model to a case study analyzing the emotional reaction of Twitter users to news about the coronavirus. Example findings include a spike in sadness when the World Health Organization (WHO) declared that coronavirus was a global pandemic, and a spike in anger and disgust when the number of COVID-19 related deaths in the United States surpassed one hundred thousand. We provide an easy-to-use and open source python library for predicting emojis with BERTmoticon so that the model can easily be applied to other data mining tasks.

pdf bib
Evaluating Word Embeddings on Low-Resource Languages
Nathan Stringham | Mike Izbicki
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.