Avi Caciularu


2021

pdf bib
Denoising Word Embeddings by Averaging in a Shared Space
Avi Caciularu | Ido Dagan | Jacob Goldberger
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

We introduce a new approach for smoothing and improving the quality of word embeddings. We consider a method of fusing word embeddings that were trained on the same corpus but with different initializations. We project all the models to a shared vector space using an efficient implementation of the Generalized Procrustes Analysis (GPA) procedure, previously used in multilingual word translation. Our word representation demonstrates consistent improvements over the raw models as well as their simplistic average, on a range of tasks. As the new representations are more stable and reliable, there is a noticeable improvement in rare word evaluations.

pdf bib
iFacetSum: Coreference-based Interactive Faceted Summarization for Multi-Document Exploration
Eran Hirsch | Alon Eirew | Ori Shapira | Avi Caciularu | Arie Cattan | Ori Ernst | Ramakanth Pasunuru | Hadar Ronen | Mohit Bansal | Ido Dagan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce iFᴀᴄᴇᴛSᴜᴍ, a web application for exploring topical document collections. iFᴀᴄᴇᴛSᴜᴍ integrates interactive summarization together with faceted search, by providing a novel faceted navigation scheme that yields abstractive summaries for the user’s selections. This approach offers both a comprehensive overview as well as particular details regard-ing subtopics of choice. The facets are automatically produced based on cross-document coreference pipelines, rendering generic concepts, entities and statements surfacing in the source texts. We analyze the effectiveness of our application through small-scale user studies that suggest the usefulness of our tool.

pdf bib
Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference
Dvir Ginzburg | Itzik Malkiel | Oren Barkan | Avi Caciularu | Noam Koenigstein
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
CDLM: Cross-Document Language Modeling
Avi Caciularu | Arman Cohan | Iz Beltagy | Matthew Peters | Arie Cattan | Ido Dagan
Findings of the Association for Computational Linguistics: EMNLP 2021

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked language modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets of multiple related documents, encouraging the model to learn cross-document relationships. Second, we improve over recent long-range transformers by introducing dynamic global attention that has access to the entire input to predict masked tokens. We release CDLM (Cross-Document Language Model), a new general language model for multi-document setting that can be easily applied to downstream tasks. Our extensive analysis shows that both ideas are essential for the success of CDLM, and work in synergy to set new state-of-the-art results for several multi-text tasks.

pdf bib
On the Evolution of Word Order
Idan Rejwan | Avi Caciularu
Proceedings of the Student Research Workshop Associated with RANLP 2021

Most natural languages have a predominant or fixed word order. For example in English the word order is usually Subject-Verb-Object. This work attempts to explain this phenomenon as well as other typological findings regarding word order from a functional perspective. In particular, we examine whether fixed word order provides a functional advantage, explaining why these languages are prevalent. To this end, we consider an evolutionary model of language and demonstrate, both theoretically and using genetic algorithms, that a language with a fixed word order is optimal. We also show that adding information to the sentence, such as case markers and noun-verb distinction, reduces the need for fixed word order, in accordance with the typological findings.

2020

pdf bib
Bayesian Hierarchical Words Representation Learning
Oren Barkan | Idan Rejwan | Avi Caciularu | Noam Koenigstein
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper presents the Bayesian Hierarchical Words Representation (BHWR) learning algorithm. BHWR facilitates Variational Bayes word representation learning combined with semantic taxonomy modeling via hierarchical priors. By propagating relevant information between related words, BHWR utilizes the taxonomy to improve the quality of such representations. Evaluation of several linguistic datasets demonstrates the advantages of BHWR over suitable alternatives that facilitate Bayesian modeling with or without semantic priors. Finally, we further show that BHWR produces better representations for rare words.

pdf bib
Within-Between Lexical Relation Classification
Oren Barkan | Avi Caciularu | Ido Dagan
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose the novel Within-Between Relation model for recognizing lexical-semantic relations between words. Our model integrates relational and distributional signals, forming an effective sub-space representation for each relation. We show that the proposed model is competitive and outperforms other baselines, across various benchmarks.

pdf bib
RecoBERT: A Catalog Language Model for Text-Based Recommendations
Itzik Malkiel | Oren Barkan | Avi Caciularu | Noam Razin | Ori Katz | Noam Koenigstein
Findings of the Association for Computational Linguistics: EMNLP 2020

Language models that utilize extensive self-supervised pre-training from unlabeled text, have recently shown to significantly advance the state-of-the-art performance in a variety of language understanding tasks. However, it is yet unclear if and how these recent models can be harnessed for conducting text-based recommendations. In this work, we introduce RecoBERT, a BERT-based approach for learning catalog-specialized language models for text-based item recommendations. We suggest novel training and inference procedures for scoring similarities between pairs of items, that don’t require item similarity labels. Both the training and the inference techniques were designed to utilize the unlabeled structure of textual catalogs, and minimize the discrepancy between them. By incorporating four scores during inference, RecoBERT can infer text-based item-to-item similarities more accurately than other techniques. In addition, we introduce a new language understanding task for wine recommendations using similarities based on professional wine reviews. As an additional contribution, we publish annotated recommendations dataset crafted by human wine experts. Finally, we evaluate RecoBERT and compare it to various state-of-the-art NLP models on wine and fashion recommendations tasks.

pdf bib
Paraphrasing vs Coreferring: Two Sides of the Same Coin
Yehudit Meged | Avi Caciularu | Vered Shwartz | Ido Dagan
Findings of the Association for Computational Linguistics: EMNLP 2020

We study the potential synergy between two different NLP tasks, both confronting predicate lexical variability: identifying predicate paraphrases, and event coreference resolution. First, we used annotations from an event coreference dataset as distant supervision to re-score heuristically-extracted predicate paraphrases. The new scoring gained more than 18 points in average precision upon their ranking by the original scoring method. Then, we used the same re-ranking features as additional inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model’s performance. The results suggest a promising direction to leverage data and models for each of the tasks to the benefit of the other.