Mike Izbicki
2023
DocSplit: Simple Contrastive Pretraining for Large Document Embeddings
Yujie Wang
|
Mike Izbicki
Findings of the Association for Computational Linguistics: EMNLP 2023
Existing model pretraining methods only consider local information. For example, in the popular token masking strategy, the words closer to the masked token are more important for prediction than words far away. This results in pretrained models that generate high-quality sentence embeddings, but low-quality embeddings for large documents. We propose a new pretraining method called DocSplit which forces models to consider the entire global context of a large document. Our method uses a contrastive loss where the positive examples are randomly sampled sections of the input document, and negative examples are randomly sampled sections of unrelated documents. Like previous pretraining methods, DocSplit is fully unsupervised, easy to implement, and can be used to pretrain any model architecture. Our experiments show that DocSplit outperforms other pretraining methods for document classification, few shot learning, and information retrieval tasks.
2022
Aligning Word Vectors on Low-Resource Languages with Wiktionary
Mike Izbicki
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)
Aligned word embeddings have become a popular technique for low-resource natural language processing. Most existing evaluation datasets are generated automatically from machine translations systems, so they have many errors and exist only for high-resource languages. We introduce the Wiktionary bilingual lexicon collection, which provides high-quality human annotated translations for words in 298 languages to English. We use these lexicons to train and evaluate the largest published collection of aligned word embeddings on 157 different languages. All of our code and data is publicly available at https://github.com/mikeizbicki/wiktionary_bli.
2020
Multilingual Emoticon Prediction of Tweets about COVID-19
Stefanos Stoikos
|
Mike Izbicki
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media
Emojis are a widely used tool for encoding emotional content in informal messages such as tweets,and predicting which emoji corresponds to a piece of text can be used as a proxy for measuring the emotional content in the text. This paper presents the first model for predicting emojis in highly multilingual text. Our BERTmoticon model is a fine-tuned version of the BERT model,and it can predict emojis for text written in 102 different languages. We trained our BERTmoticon model on 54.2 million geolocated tweets sent in the first 6 months of 2020,and we apply the model to a case study analyzing the emotional reaction of Twitter users to news about the coronavirus. Example findings include a spike in sadness when the World Health Organization (WHO) declared that coronavirus was a global pandemic, and a spike in anger and disgust when the number of COVID-19 related deaths in the United States surpassed one hundred thousand. We provide an easy-to-use and open source python library for predicting emojis with BERTmoticon so that the model can easily be applied to other data mining tasks.
Evaluating Word Embeddings on Low-Resource Languages
Nathan Stringham
|
Mike Izbicki
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.
Search