Oren Melamud

2019

Combining Unsupervised Pre-training and Annotator Rationales to Improve Low-shot Text Classification
Oren Melamud | Mihaela Bornea | Ken Barker
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Supervised learning models often perform poorly at low-shot tasks, i.e. tasks for which little labeled data is available for training. One prominent approach for improving low-shot learning is to use unsupervised pre-trained neural models. Another approach is to obtain richer supervision by collecting annotator rationales (explanations supporting label annotations). In this work, we combine these two approaches to improve low-shot text classification with two novel methods: a simple bag-of-words embedding approach; and a more complex context-aware method, based on the BERT model. In experiments with two English text classification datasets, we demonstrate substantial performance gains from combining pre-training with rationales. Furthermore, our investigation of a range of train-set sizes reveals that the simple bag-of-words approach is the clear top performer when there are only a few dozen training instances or less, while more complex models, such as BERT or CNN, require more training data to shine.

pdf bib abs

Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models
Oren Melamud | Chaitanya Shivade
Proceedings of the 2nd Clinical Natural Language Processing Workshop

Large-scale clinical data is invaluable to driving many computational scientific advances today. However, understandable concerns regarding patient privacy hinder the open dissemination of such data and give rise to suboptimal siloed research. De-identification methods attempt to address these concerns but were shown to be susceptible to adversarial attacks. In this work, we focus on the vast amounts of unstructured natural language data stored in clinical notes and propose to automatically generate synthetic clinical notes that are more amenable to sharing using generative models trained on real de-identified records. To evaluate the merit of such notes, we measure both their privacy preservation properties as well as utility in training clinical NLP models. Experiments using neural language models yield notes whose utility is close to that of the real ones in some clinical NLP tasks, yet leave ample room for future improvements.

2018

pdf bib abs

Self-Normalization Properties of Language Modeling
Jacob Goldberger | Oren Melamud
Proceedings of the 27th International Conference on Computational Linguistics

Self-normalizing discriminative models approximate the normalized probability of a class without having to compute the partition function. In the context of language modeling, this property is particularly appealing as it may significantly reduce run-times due to large word vocabularies. In this study, we provide a comprehensive investigation of language modeling self-normalization. First, we theoretically analyze the inherent self-normalization properties of Noise Contrastive Estimation (NCE) language models. Then, we compare them empirically to softmax-based approaches, which are self-normalized using explicit regularization, and suggest a hybrid model with compelling properties. Finally, we uncover a surprising negative correlation between self-normalization and perplexity across the board, as well as some regularity in the observed errors, which may potentially be used for improving self-normalization algorithms in the future.

2017

pdf bib abs

Information-Theory Interpretation of the Skip-Gram Negative-Sampling Objective Function
Oren Melamud | Jacob Goldberger
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper we define a measure of dependency between two random variables, based on the Jensen-Shannon (JS) divergence between their joint distribution and the product of their marginal distributions. Then, we show that word2vec’s skip-gram with negative sampling embedding algorithm finds the optimal low-dimensional approximation of this JS dependency measure between the words and their contexts. The gap between the optimal score and the low-dimensional approximation is demonstrated on a standard text corpus.

pdf bib abs

A Simple Language Model based on PMI Matrix Approximations
Oren Melamud | Ido Dagan | Jacob Goldberger
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this study, we introduce a new approach for learning language models by training them to estimate word-context pointwise mutual information (PMI), and then deriving the desired conditional probabilities from PMI at test time. Specifically, we show that with minor modifications to word2vec’s algorithm, we get principled language models that are closely related to the well-established Noise Contrastive Estimation (NCE) based language models. A compelling aspect of our approach is that our models are trained with the same simple negative sampling objective function that is commonly used in word2vec to learn word embeddings.

2016

pdf bib abs

The Negochat Corpus of Human-agent Negotiation Dialogues
Vasily Konovalov | Ron Artstein | Oren Melamud | Ido Dagan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Annotated in-domain corpora are crucial to the successful development of dialogue systems of automated agents, and in particular for developing natural language understanding (NLU) components of such systems. Unfortunately, such important resources are scarce. In this work, we introduce an annotated natural language human-agent dialogue corpus in the negotiation domain. The corpus was collected using Amazon Mechanical Turk following the ‘Wizard-Of-Oz’ approach, where a ‘wizard’ human translates the participants’ natural language utterances in real time into a semantic language. Once dialogue collection was completed, utterances were annotated with intent labels by two independent annotators, achieving high inter-annotator agreement. Our initial experiments with an SVM classifier show that automatically inferring such labels from the utterances is far from trivial. We make our corpus publicly available to serve as an aid in the development of dialogue systems for negotiation agents, and suggest that analogous corpora can be created following our methodology and using our available source code. To the best of our knowledge this is the first publicly available negotiation dialogue corpus.

pdf bib

The Role of Context Types and Dimensionality in Learning Word Embeddings
Oren Melamud | David McClosky | Siddharth Patwardhan | Mohit Bansal
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises
Michael Wojatzki | Oren Melamud | Torsten Zesch
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib

context2vec: Learning Generic Context Embedding with Bidirectional LSTM
Oren Melamud | Jacob Goldberger | Ido Dagan
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning