Guy Kushilevitz


2024

pdf bib
Class Balancing for Efficient Active Learning in Imbalanced Datasets
Yaron Fairstein | Oren Kalinsky | Zohar Karnin | Guy Kushilevitz | Alexander Libov | Sofia Tolmach
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)

Recent developments in active learning algorithms for NLP tasks show promising results in terms of reducing labelling complexity. In this paper we extend this effort to imbalanced datasets; we bridge between the active learning approach of obtaining diverse andinformative examples, and the heuristic of class balancing used in imbalanced datasets. We develop a novel tune-free weighting technique that canbe applied to various existing active learning algorithms, adding a component of class balancing. We compare several active learning algorithms to their modified version on multiple public datasetsand show that when the classes are imbalanced, with manual annotation effort remaining equal the modified version significantly outperforms the original both in terms of the test metric and the number of obtained minority examples. Moreover, when the imbalance is mild or non-existent (classes are completely balanced), our technique does not harm the base algorithms.

2023

pdf bib
Simple and Effective Multi-Token Completion from Masked Language Models
Oren Kalinsky | Guy Kushilevitz | Alexander Libov | Yoav Goldberg
Findings of the Association for Computational Linguistics: EACL 2023

Pre-trained neural masked language models are often used for predicting a replacement token for a given sequence position, in a cloze-like task. However, this usage is restricted to predicting a single token, from a relatively small pre-trained vocabulary. Recent Sequence2Sequence pre-trained LMs like T5 do allow predicting multi-token completions, but are more expensive to train and run. We show that pre-trained masked language models can be adapted to produce multi-token completions, with only a modest addition to their parameter count. We propose two simple adaptation approaches, trading parameter counts for accuracy. The first method generates multi-token completions from a conditioned RNN. It has a very low parameter count and achieves competitive results. The second method is even simpler: it adds items corresponding to multi-token units to the output prediction matrix. While being higher in parameter count than the RNN method, it also surpasses current state-of-the-art multi-token completion models, including T5-3B, while being significantly more parameter efficient. We demonstrate that our approach is flexible to different vocabularies and domains and can effectively leverage existing pre-trained models available in different domains. Finally, a human evaluation further validates our results and shows that our solution regularly provides valid completions, as well as reasonable correctness for factual-sentence completions.

2020

pdf bib
A Two-Stage Masked LM Method for Term Set Expansion
Guy Kushilevitz | Shaul Markovitch | Yoav Goldberg
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We tackle the task of Term Set Expansion (TSE): given a small seed set of example terms from a semantic class, finding more members of that class. The task is of great practical utility, and also of theoretical utility as it requires generalization from few examples. Previous approaches to the TSE task can be characterized as either distributional or pattern-based. We harness the power of neural masked language models (MLM) and propose a novel TSE algorithm, which combines the pattern-based and distributional approaches. Due to the small size of the seed set, fine-tuning methods are not effective, calling for more creative use of the MLM. The gist of the idea is to use the MLM to first mine for informative patterns with respect to the seed set, and then to obtain more members of the seed class by generalizing these patterns. Our method outperforms state-of-the-art TSE algorithms. Implementation is available at: https://github.com/guykush/TermSetExpansion-MPB/