Idan Szpektor


2023

pdf bib
On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method
Zorik Gekhman | Nadav Oved | Orgad Keller | Idan Szpektor | Roi Reichart
Transactions of the Association for Computational Linguistics, Volume 11

Most work on modeling the conversation history in Conversational Question Answering (CQA) reports a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g., from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy to plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.1

pdf bib
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Zorik Gekhman | Jonathan Herzig | Roee Aharoni | Chen Elkind | Idan Szpektor
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.

pdf bib
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Paul Roit | Johan Ferret | Lior Shani | Roee Aharoni | Geoffrey Cideron | Robert Dadashi | Matthieu Geist | Sertan Girgin | Leonard Hussenot | Orgad Keller | Nikola Momchev | Sabela Ramos Garea | Piotr Stanczyk | Nino Vieillard | Olivier Bachem | Gal Elidan | Avinatan Hassidim | Olivier Pietquin | Idan Szpektor
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries.

pdf bib
DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
Ella Neeman | Roee Aharoni | Or Honovich | Leshem Choshen | Idan Szpektor | Omri Abend
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Question answering models commonly have access to two sources of “knowledge” during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.

pdf bib
Multilingual Sequence-to-Sequence Models for Hebrew NLP
Matan Eyal | Hila Noga | Roee Aharoni | Idan Szpektor | Reut Tsarfaty
Findings of the Association for Computational Linguistics: ACL 2023

Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for large LMs in morphologically rich languages (MRLs) such as Hebrew. We demonstrate this by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, for which we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a separate, specialized, morpheme-based, decoder. Using this approach, our experiments show substantial improvements over previously published results on all existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.

pdf bib
MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo | Linting Xue | Michal Yarom | Ashish Thapliyal | Idan Szpektor | Julien Amelot | Xi Chen | Radu Soricut
Findings of the Association for Computational Linguistics: EMNLP 2023

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, lightweight, and effective approach as well as benchmark state-of-the-art English and multilingual VQA models. We hope that our benchmark encourages further research on mVQA.

2022

pdf bib
All You May Need for VQA are Image Captions
Soravit Changpinyo | Doron Kukliansy | Idan Szpektor | Xi Chen | Nan Ding | Radu Soricut
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

pdf bib
TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods.

pdf bib
TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better methods.

2021

pdf bib
Q2: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Or Honovich | Leshem Choshen | Roee Aharoni | Ella Neeman | Idan Szpektor | Omri Abend
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted Q2, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of Q2 against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.

2020

pdf bib
Semantically Driven Sentence Fusion: Modeling and Evaluation
Eyal Ben-David | Orgad Keller | Eric Malmi | Idan Szpektor | Roi Reichart
Findings of the Association for Computational Linguistics: EMNLP 2020

Sentence fusion is the task of joining related sentences into coherent text. Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants. We show that this hinders models from robustly capturing the semantic relationship between input sentences. To alleviate this, we present an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases. We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation. To improve the learning of semantic representation using multiple references, we enrich the model with auxiliary discourse classification tasks under a multi-tasking framework. Our experiments highlight the improvements of our approach over state-of-the-art models.

2019

pdf bib
A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy
Genady Beryozkin | Yoel Drori | Oren Gilon | Tzvika Hartman | Idan Szpektor
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study a variant of domain adaptation for named-entity recognition where multiple, heterogeneously tagged training sets are available. Furthermore, the test tag-set is not identical to any individual training tag-set. Yet, the relations between all tags are provided in a tag hierarchy, covering the test tags as a combination of training tags. This setting occurs when various datasets are created using different annotation schemes. This is also the case of extending a tag-set with a new tag by annotating only the new tag in a new dataset. We propose to use the given tag hierarchy to jointly learn a neural network that shares its tagging layer among all tag-sets. We compare this model to combining independent models and to a model based on the multitasking approach. Our experiments show the benefit of the tag-hierarchy model, especially when facing non-trivial consolidation of tag-sets.

pdf bib
DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion
Mor Geva | Eric Malmi | Idan Szpektor | Jonathan Berant
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.

pdf bib
Audio De-identification - a New Entity Recognition Task
Ido Cohn | Itay Laish | Genady Beryozkin | Gang Li | Izhak Shafran | Idan Szpektor | Tzvika Hartman | Avinatan Hassidim | Yossi Matias
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written text. The application of NER in the context of audio de-identification has yet to be fully investigated. To this end, we define the task of audio de-ID, in which audio spans with entity mentions should be detected. We then present our pipeline for this task, which involves Automatic Speech Recognition (ASR), NER on the transcript text, and text-to-audio alignment. Finally, we introduce a novel metric for audio de-ID and a new evaluation benchmark consisting of a large labeled segment of the Switchboard and Fisher audio datasets and detail our pipeline’s results on it.

2016

pdf bib
Syntactic Parsing of Web Queries with Question Intent
Yuval Pinter | Roi Reichart | Idan Szpektor
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib
Probabilistic Modeling of Joint-context in Distributional Similarity
Oren Melamud | Ido Dagan | Jacob Goldberger | Idan Szpektor | Deniz Yuret
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

pdf bib
Generating Synthetic Comparable Questions for News Articles
Oleg Rokhlenko | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Two Level Model for Context Sensitive Inference Rules
Oren Melamud | Jonathan Berant | Ido Dagan | Jacob Goldberger | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Using Lexical Expansion to Learn Inference Rules from Sparse Data
Oren Melamud | Ido Dagan | Jacob Goldberger | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2012

pdf bib
Learning Verb Inference Rules from Linguistically-Motivated Evidence
Hila Weisman | Jonathan Berant | Idan Szpektor | Ido Dagan
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Classification-based Contextual Preferences
Shachar Mirkin | Ido Dagan | Lili Kotlerman | Idan Szpektor
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

2010

pdf bib
Generating Entailment Rules from FrameNet
Roni Ben Aharon | Idan Szpektor | Ido Dagan
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Textual Entailment
Mark Sammons | Idan Szpektor | V.G.Vinod Vydiswaran
NAACL HLT 2010 Tutorial Abstracts

2009

pdf bib
Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Directional Distributional Similarity for Lexical Expansion
Lili Kotlerman | Ido Dagan | Idan Szpektor | Maayan Zhitomirsky-Geffet
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Augmenting WordNet-based Inference with Argument Mapping
Idan Szpektor | Ido Dagan
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

2008

pdf bib
Contextual Preferences
Idan Szpektor | Ido Dagan | Roy Bar-Haim | Jacob Goldberger
Proceedings of ACL-08: HLT

pdf bib
Learning Entailment Rules for Unary Templates
Idan Szpektor | Ido Dagan
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Cross Lingual and Semantic Retrieval for Cultural Heritage Appreciation
Idan Szpektor | Ido Dagan | Alon Lavie | Danny Shacham | Shuly Wintner
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

pdf bib
Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition
Roy Bar-Haim | Ido Dagan | Iddo Greental | Idan Szpektor | Moshe Friedman
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

pdf bib
Instance-based Evaluation of Entailment Rule Acquisition
Idan Szpektor | Eyal Shnarch | Ido Dagan
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

pdf bib
Investigating a Generic Paraphrase-Based Approach for Relation Extraction
Lorenza Romano | Milen Kouylekov | Idan Szpektor | Ido Dagan | Alberto Lavelli
11th Conference of the European Chapter of the Association for Computational Linguistics

2005

pdf bib
Definition and Analysis of Intermediate Entailment Levels
Roy Bar-Haim | Idan Szpektor | Oren Glickman
Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment

2004

pdf bib
Scaling Web-based Acquisition of Entailment Relations
Idan Szpektor | Hristo Tanev | Ido Dagan | Bonaventura Coppola
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing