Idan Szpektor - ACL Anthology

Idan Szpektor

2026

Localizing Factual Inconsistencies in Attributable Text Generation
Arie Cattan | Paul Roit | Shiyue Zhang | David Wan | Roee Aharoni | Idan Szpektor | Mohit Bansal | Ido Dagan
Transactions of the Association for Computational Linguistics, Volume 14

There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement. This benchmark includes more than 3K instances spanning various tasks of attributable text generation. We also show that QASemConsistency yields factual consistency scores that correlate well with human judgments. Finally, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and LLMs.1

2025

MDCure: A Scalable Pipeline for Multi-Document Instruction-Following
Gabrielle Kaili-May Liu | Bowen Shi | Avi Caciularu | Idan Szpektor | Arman Cohan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-document (MD) processing is crucial for LLMs to handle real-world tasks such as summarization and question-answering across large sets of documents. While LLMs have improved at processing long inputs, MD contexts still present unique difficulties, including management of inter-document dependencies, redundancy, and incoherent structures. To address this challenge, we introduce MDCure, a scalable and effective instruction data generation framework to enhance the MD capabilities of LLMs without the computational cost of pre-training or reliance on human-annotated data. MDCure generates high-quality synthetic MD instruction data over sets of articles via targeted prompts. We also introduce MDCureRM, a cost-effective, MD-specific reward model to score and filter generated data based on their training utility for MD settings. MDCure is compatible with open- and closed-source models in addition to policy optimization methods such as PPO, enabling even small open- source models to surpass proprietary LLMs as strong generators of high-quality MD instruction data without further data filtering. With MDCure, we fine-tune a wide variety of LLMs up to 70B parameters in size from the FlanT5, Qwen2, and LLAMA3.1 model families. Extensive evaluations on a wide range of MD and long-context benchmarks spanning various tasks and domains show MDCure consistently improves performance over pre-trained baselines and base models by up to 75.1%.

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
Maxim Ifergan | Leshem Choshen | Roee Aharoni | Idan Szpektor | Omri Abend
Findings of the Association for Computational Linguistics: NAACL 2025

The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model’s ability to answer a query consistently across languages, and the ability to "store" answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.

DoubleDipper: Recycling Contexts for Efficient and Attributed In-Context Learning
Arie Cattan | Alon Jacovi | Alex Fabrikant | Jonathan Herzig | Roee Aharoni | Hannah Rashkin | Dror Marcus | Avinatan Hassidim | Yossi Matias | Idan Szpektor | Avi Caciularu
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

In this work, we propose DoubleDipper, a novel In-Context-Learning method that automatically generates few-shot examples for several QA tasks by _recycling_ contexts. Specifically, given an input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to _explicitly_ identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets. Surprisingly, despite introducing only single-hop ICL examples, LLMs successfully generalize to multi-hop QA using our approach.

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
Moran Yanuka | Assaf Ben-Kish | Yonatan Bitton | Idan Szpektor | Raja Giryes
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent research increasingly focuses on training vision-language models (VLMs) with long, detailed image captions. However, small-scale VLMs often struggle to balance the richness of these captions with the risk of hallucinating content during fine-tuning. In this paper, we explore how well VLMs adapt to such captions. To quantify caption quality, we propose Decomposed NLI (DNLI), an evaluation framework that breaks down generated captions into individual propositions, assessing each in isolation. This fine-grained analysis reveals a critical balance between capturing descriptive details and preventing hallucinations. Our findings show that simply reducing caption complexity or employing standard data curation techniques does not effectively resolve this issue. To tackle this challenge, we introduce Knowledge Adapted (KnowAda) fine-tuning, a data-centric approach that automatically adapts training data with the model’s existing knowledge and visual understanding. KnowAda minimizes hallucinations while preserving high descriptiveness. We validate this approach across several small-scale VLMs (up to 7B parameters) and dense caption datasets, demonstrating that KnowAda effectively balances hallucination reduction and descriptiveness. Our results show that KnowAda outperforms various baselines in both automatic metrics and human evaluations.

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance
Omer Nahum | Nitay Calderon | Orgad Keller | Idan Szpektor | Roi Reichart
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. We conduct a case study on four factual consistency datasets from the TRUE benchmark, spanning diverse NLP tasks, and on SummEval, which uses Likert-scale ratings of summary quality across multiple dimensions. We empirically analyze the labeling quality of existing datasets and compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs’ so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve performance.

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
Aviv Slobodkin | Hagai Taitelbaum | Yonatan Bitton | Brian Gordon | Michal Sokolik | Nitzan Bitton Guetta | Almog Gueta | Royi Rassin | Dani Lischinski | Idan Szpektor
Findings of the Association for Computational Linguistics: EMNLP 2025

Subject-driven text-to-image (T2I) generation aims to produce images that align with a given textual description, while preserving the visual identity from a referenced subject image. Despite its broad downstream applicability—ranging from enhanced personalization in image generation to consistent character representation in video rendering—progress in this field is limited by the lack of reliable automatic evaluation. Existing methods either assess only one aspect of the task (i.e., textual alignment or subject preservation), misalign with human judgments, or rely on costly API-based evaluation. To address this gap, we introduce RefVNLI, a cost-effective metric that evaluates both textual alignment and subject preservation in a single run. Trained on a large-scale dataset derived from video-reasoning benchmarks and image perturbations, RefVNLI outperforms or statistically matches existing baselines across multiple benchmarks and subject categories (e.g., Animal, Object), achieving up to 6.4-point gains in textual alignment and 5.9-point gains in subject preservation.

MetaFaith: Faithful Natural Language Uncertainty Expression in LLMs
Gabrielle Kaili-May Liu | Gal Yona | Avi Caciularu | Idan Szpektor | Tim G. J. Rudner | Arman Cohan
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

A critical component in the trustworthiness of LLMs is reliable uncertainty communication, yet LLMs often use assertive language when conveying false claims, leading to over-reliance and eroded trust. We present the first systematic study of _faithful confidence calibration_ of LLMs, benchmarking models’ ability to use linguistic expressions of uncertainty that _faithfully reflect_ their intrinsic uncertainty, across a comprehensive array of models, datasets, and prompting strategies. Our results demonstrate that LLMs largely fail at this task, and that existing interventions are insufficient: standard prompt approaches provide only marginal gains, and existing, factuality-based calibration techniques can even harm faithful calibration. To address this critical gap, we introduce MetaFaith, a novel prompt-based calibration approach inspired by human metacognition. We show that MetaFaith robustly improves faithful calibration across diverse models and task domains, enabling up to 61% improvement in faithfulness and achieving an 83% win rate over original generations as judged by humans.

2024

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
João Bordalo | Vasco Ramos | Rodrigo Valério | Diogo Glória-Silva | Yonatan Bitton | Michal Yarom | Idan Szpektor | Joao Magalhaes
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision/Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across steps in both domains.

Multilingual Instruction Tuning With Just a Pinch of Multilinguality
Uri Shaham | Jonathan Herzig | Roee Aharoni | Idan Szpektor | Reut Tsarfaty | Matan Eyal
Findings of the Association for Computational Linguistics: ACL 2024

As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses.

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance
Omer Goldman | Avi Caciularu | Matan Eyal | Kris Cao | Idan Szpektor | Reut Tsarfaty
Findings of the Association for Computational Linguistics: ACL 2024

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be viewed as 0-gram language modeling where equal probability is assigned to all tokens. We also demonstrate the empirical importance of compression for downstream success of pre-trained language models. We control the compression ability of several BPE tokenizers by varying the amount of documents available during their training: from 1 million documents to a character-based tokenizer equivalent to no training data at all. We then pre-train English language models based on those tokenizers and fine-tune them over several tasks. We show that there is a correlation between tokenizers’ compression and models’ downstream performance, suggesting that compression is a reliable intrinsic indicator of tokenization quality. These correlations are more pronounced for generation tasks (over classification) or for smaller models (over large ones). We replicated a representative part of our experiments on Turkish and found similar results, confirming that our results hold for languages with typological characteristics dissimilar to English. We conclude that building better compressing tokenizers is a fruitful avenue for further research and for improving overall model performance.

2023

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
Ella Neeman | Roee Aharoni | Or Honovich | Leshem Choshen | Idan Szpektor | Omri Abend
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Question answering models commonly have access to two sources of “knowledge” during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.

MaXM: Towards Multilingual Visual Question Answering
Soravit Changpinyo | Linting Xue | Michal Yarom | Ashish Thapliyal | Idan Szpektor | Julien Amelot | Xi Chen | Radu Soricut
Findings of the Association for Computational Linguistics: EMNLP 2023

Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose scalable solutions to multilingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple, lightweight, and effective approach as well as benchmark state-of-the-art English and multilingual VQA models. We hope that our benchmark encourages further research on mVQA.

On the Robustness of Dialogue History Representation in Conversational Question Answering: A Comprehensive Study and a New Prompt-based Method
Zorik Gekhman | Nadav Oved | Orgad Keller | Idan Szpektor | Roi Reichart
Transactions of the Association for Computational Linguistics, Volume 11

Most work on modeling the conversation history in Conversational Question Answering (CQA) reports a single main result on a common CQA benchmark. While existing models show impressive results on CQA leaderboards, it remains unclear whether they are robust to shifts in setting (sometimes to more realistic ones), training data size (e.g., from large to small sets) and domain. In this work, we design and conduct the first large-scale robustness study of history modeling approaches for CQA. We find that high benchmark scores do not necessarily translate to strong robustness, and that various methods can perform extremely differently under different settings. Equipped with the insights from our study, we design a novel prompt-based history modeling approach and demonstrate its strong robustness across various settings. Our approach is inspired by existing methods that highlight historic answers in the passage. However, instead of highlighting by modifying the passage token embeddings, we add textual prompts directly in the passage text. Our approach is simple, easy to plug into practically any model, and highly effective, thus we recommend it as a starting point for future model developers. We also hope that our study and insights will raise awareness to the importance of robustness-focused evaluation, in addition to obtaining high leaderboard scores, leading to better CQA systems.1

Multilingual Sequence-to-Sequence Models for Hebrew NLP
Matan Eyal | Hila Noga | Roee Aharoni | Idan Szpektor | Reut Tsarfaty
Findings of the Association for Computational Linguistics: ACL 2023

Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for large LMs in morphologically rich languages (MRLs) such as Hebrew. We demonstrate this by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, for which we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a separate, specialized, morpheme-based, decoder. Using this approach, our experiments show substantial improvements over previously published results on all existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.

Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual-entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience and conciseness of the generated summaries.

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
Zorik Gekhman | Jonathan Herzig | Roee Aharoni | Chen Elkind | Idan Szpektor
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data.

2022

TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods.

TRUE: Re-evaluating Factual Consistency Evaluation
Or Honovich | Roee Aharoni | Jonathan Herzig | Hagai Taitelbaum | Doron Kukliansy | Vered Cohen | Thomas Scialom | Idan Szpektor | Avinatan Hassidim | Yossi Matias
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better methods.

All You May Need for VQA are Image Captions
Soravit Changpinyo | Doron Kukliansy | Idan Szpektor | Xi Chen | Nan Ding | Radu Soricut
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Visual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.

2021

Q²: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Or Honovich | Leshem Choshen | Roee Aharoni | Ella Neeman | Idan Szpektor | Omri Abend
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted Q², compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of Q² against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.

2020

Semantically Driven Sentence Fusion: Modeling and Evaluation
Eyal Ben-David | Orgad Keller | Eric Malmi | Idan Szpektor | Roi Reichart
Findings of the Association for Computational Linguistics: EMNLP 2020

Sentence fusion is the task of joining related sentences into coherent text. Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants. We show that this hinders models from robustly capturing the semantic relationship between input sentences. To alleviate this, we present an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases. We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation. To improve the learning of semantic representation using multiple references, we enrich the model with auxiliary discourse classification tasks under a multi-tasking framework. Our experiments highlight the improvements of our approach over state-of-the-art models.

2019

A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy
Genady Beryozkin | Yoel Drori | Oren Gilon | Tzvika Hartman | Idan Szpektor
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study a variant of domain adaptation for named-entity recognition where multiple, heterogeneously tagged training sets are available. Furthermore, the test tag-set is not identical to any individual training tag-set. Yet, the relations between all tags are provided in a tag hierarchy, covering the test tags as a combination of training tags. This setting occurs when various datasets are created using different annotation schemes. This is also the case of extending a tag-set with a new tag by annotating only the new tag in a new dataset. We propose to use the given tag hierarchy to jointly learn a neural network that shares its tagging layer among all tag-sets. We compare this model to combining independent models and to a model based on the multitasking approach. Our experiments show the benefit of the tag-hierarchy model, especially when facing non-trivial consolidation of tag-sets.

Audio De-identification - a New Entity Recognition Task
Ido Cohn | Itay Laish | Genady Beryozkin | Gang Li | Izhak Shafran | Idan Szpektor | Tzvika Hartman | Avinatan Hassidim | Yossi Matias
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

Named Entity Recognition (NER) has been mostly studied in the context of written text. Specifically, NER is an important step in de-identification (de-ID) of medical records, many of which are recorded conversations between a patient and a doctor. In such recordings, audio spans with personal information should be redacted, similar to the redaction of sensitive character spans in de-ID for written text. The application of NER in the context of audio de-identification has yet to be fully investigated. To this end, we define the task of audio de-ID, in which audio spans with entity mentions should be detected. We then present our pipeline for this task, which involves Automatic Speech Recognition (ASR), NER on the transcript text, and text-to-audio alignment. Finally, we introduce a novel metric for audio de-ID and a new evaluation benchmark consisting of a large labeled segment of the Switchboard and Fisher audio datasets and detail our pipeline’s results on it.

DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion
Mor Geva | Eric Malmi | Idan Szpektor | Jonathan Berant
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Sentence fusion is the task of joining several independent sentences into a single coherent text. Current datasets for sentence fusion are small and insufficient for training modern neural models. In this paper, we propose a method for automatically-generating fusion examples from raw text and present DiscoFuse, a large scale dataset for discourse-based sentence fusion. We author a set of rules for identifying a diverse set of discourse phenomena in raw text, and decomposing the text into two independent sentences. We apply our approach on two document collections: Wikipedia and Sports articles, yielding 60 million fusion examples annotated with discourse information required to reconstruct the fused text. We develop a sequence-to-sequence model on DiscoFuse and thoroughly analyze its strengths and weaknesses with respect to the various discourse phenomena, using both automatic as well as human evaluation. Finally, we conduct transfer learning experiments with WebSplit, a recent dataset for text simplification. We show that pretraining on DiscoFuse substantially improves performance on WebSplit when viewed as a sentence fusion task.

2016

Syntactic Parsing of Web Queries with Question Intent
Yuval Pinter | Roi Reichart | Idan Szpektor
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Probabilistic Modeling of Joint-context in Distributional Similarity
Oren Melamud | Ido Dagan | Jacob Goldberger | Idan Szpektor | Deniz Yuret
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

A Two Level Model for Context Sensitive Inference Rules
Oren Melamud | Jonathan Berant | Ido Dagan | Jacob Goldberger | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Using Lexical Expansion to Learn Inference Rules from Sparse Data
Oren Melamud | Ido Dagan | Jacob Goldberger | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Generating Synthetic Comparable Questions for News Articles
Oleg Rokhlenko | Idan Szpektor
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

Learning Verb Inference Rules from Linguistically-Motivated Evidence
Hila Weisman | Jonathan Berant | Idan Szpektor | Ido Dagan
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

Classification-based Contextual Preferences
Shachar Mirkin | Ido Dagan | Lili Kotlerman | Idan Szpektor
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

2010

Textual Entailment
Mark Sammons | Idan Szpektor | V.G. Vinod Vydiswaran
NAACL HLT 2010 Tutorial Abstracts

Generating Entailment Rules from FrameNet
Roni Ben Aharon | Idan Szpektor | Ido Dagan
Proceedings of the ACL 2010 Conference Short Papers

2009

Directional Distributional Similarity for Lexical Expansion
Lili Kotlerman | Ido Dagan | Idan Szpektor | Maayan Zhitomirsky-Geffet
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

Augmenting WordNet-based Inference with Argument Mapping
Idan Szpektor | Ido Dagan
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

2008

Contextual Preferences
Idan Szpektor | Ido Dagan | Roy Bar-Haim | Jacob Goldberger
Proceedings of ACL-08: HLT

Learning Entailment Rules for Unary Templates
Idan Szpektor | Ido Dagan
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

Cross Lingual and Semantic Retrieval for Cultural Heritage Appreciation
Idan Szpektor | Ido Dagan | Alon Lavie | Danny Shacham | Shuly Wintner
Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007).

Instance-based Evaluation of Entailment Rule Acquisition
Idan Szpektor | Eyal Shnarch | Ido Dagan
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition
Roy Bar-Haim | Ido Dagan | Iddo Greental | Idan Szpektor | Moshe Friedman
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

2006

Investigating a Generic Paraphrase-Based Approach for Relation Extraction
Lorenza Romano | Milen Kouylekov | Idan Szpektor | Ido Dagan | Alberto Lavelli
11th Conference of the European Chapter of the Association for Computational Linguistics

2005

Definition and Analysis of Intermediate Entailment Levels
Roy Bar-Haim | Idan Szpektor | Oren Glickman
Proceedings of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment

2004

Scaling Web-based Acquisition of Entailment Relations
Idan Szpektor | Hristo Tanev | Ido Dagan | Bonaventura Coppola
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

Co-authors

Jacob Goldberger 4

Jonathan Berant 3

Yonatan Bitton 3

Leshem Choshen 3

Doron Kukliansy 3

Hagai Taitelbaum 3

Reut Tsarfaty 3

Genady Beryozkin 2

Soravit Changpinyo 2

Zorik Gekhman 2

Tzvika Hartman 2

Lili Kotlerman 2

Gabrielle Kaili-May Liu 2

Shachar Mirkin 2

Thomas Scialom 2

Julien Amelot 1

Olivier Bachem 1

Roni Ben Aharon 1

Eyal Ben-David 1

Assaf Ben-Kish 1

Nitzan Bitton Guetta 1

João Bordalo 1

Nitay Calderon 1

Nicola Cancedda 1

Geoffrey Cideron 1

Bonaventura Coppola 1

Robert Dadashi 1

Marc Dymetman 1

Alex Fabrikant 1

Moshe Friedman 1

Maayan Geffet 1

Matthieu Geist 1

Sertan Girgin 1

Oren Glickman 1

Diogo Glória-Silva 1

Iddo Greental 1

Leonard Hussenot 1

Maxim Ifergan 1

Milen Kouylekov 1

Alberto Lavelli 1

Dani Lischinski 1

João Magalhães 1

Nikola Momchev 1

Olivier Pietquin 1

Sabela Ramos Garea 1

Hannah Rashkin 1

Oleg Rokhlenko 1

Lorenza Romano 1

Tim G. J. Rudner 1

Danny Shacham 1

Izhak Shafran 1

Aviv Slobodkin 1

Michal Sokolik 1

Piotr Stanczyk 1

Ashish Thapliyal 1

Rodrigo Valério 1

Nino Vieillard 1

V. G. Vinod Vydiswaran 1

Shuly Wintner 1

Venues