Leshem Choshen - ACL Anthology

Leshem Choshen

2026

Pretraining Language Models for Diachronic Linguistic Change Discovery
Elisabeth Fittschen | Sabrina Xin Li | Tom Lippincott | Leshem Choshen | Craig Messner
Findings of the Association for Computational Linguistics: EACL 2026

Large language models (LLMs) are increasingly used as knowledge discovery tools. Humanistic disciplines like historical linguistics and literary studies have shown interest in this capability. These fields often construct arguments on the basis of distinctions between phenomena like time-period or genre. Such methodological investments complicate reliance on LLMs pretrained over large sets of broadly-collected data. We show that efficient pretraining techniques produce useful models of semantic change over modest historical corpora without allowing potential contamination from anachronistic data. We verify that these trained-from-scratch models better respect historical divisions and are more computationally efficient compared to the standard approach of fine-tuning an existing LLM. We compare the trade-offs in general linguistic fluency versus detecting and characterizing various forms of linguistic change, and provide a pipeline implementation of our approach that can be readily adapted and applied to a wide range of diachronic phenomena.

Stop Guessing When to Stop Testing: Efficient Model Evaluation with Just Enough Data
Ofir Arviv | Kristjan Greenewald | Yotam Perlitz | Hadar Mulian | Michal Shmueli-Scheuer | Leshem Choshen
Findings of the Association for Computational Linguistics: ACL 2026

The inherent rigidity of fixed-size benchmarks makes them an inefficient tool for model evaluation. Diverse evaluation objectives, including model ranking, model selection and testing throughout development, demand varying levels of statistical power. The mismatch between fixed sample sizes and these diverse needs results in either excessive computational cost or compromised reliability – a critical concern for model evaluation. To overcome these limitations, we call for adoption of sequential testing in our field. We provide an adaptive evaluation framework, that provides a principled way to navigate the trade-off between efficiency and reliability in model evaluation. Our framework combines the established statistical paradigm of sequential testing with stopping criteria tailored to common evaluation needs such as diminishing returns detection, and minimum detectable effect size. We demonstrate its ability to adaptively manage the efficiency-reliability trade-off on the Open VLM Leaderboard, including, for example, a 80% reduction in computational cost compared to fixed-size evaluation (with a 2.5-point CI width allowance) while maintaining statistical significance.

Will it Merge? On The Causes of Model Mergeability
Adir Rahamim | Asaf Yehudai | Boaz Carmeli | Leshem Choshen | Yosi Mass | Yonatan Belinkov
Findings of the Association for Computational Linguistics: ACL 2026

Model merging has emerged as a promising technique for combining multiple fine-tuned models into a single multitask model without retraining. However, the factors that determine whether merging will succeed or fail remain poorly understood. In this work, we investigate why specific models are merged better than others. To do so, we propose a concrete, measurable definition of mergeability. We investigate several potential causes for high or low mergeability, highlighting the base model knowledge as a dominant factor: Models fine-tuned on instances that the base model knows better are more mergeable than models fine-tuned on instances that the base model struggles with. Based on our mergeability definition, we explore a simple weighted merging technique that better preserves weak knowledge in the base model.

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya | Asaf Yehudai | Leshem Choshen | Omri Abend
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The “LLM-as-a-judge” paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez | Laurie Burchell | Catherine Arnett | Rafael Mosquera | Sara Hincapié Monsalve | Thom Vaughan | Damian Stewart | Malte Ostendorff | Idris Abdulmumin | Vukosi Marivate | Shamsuddeen Hassan Muhammad | Atnafu Lambebo Tonja | Hend Al-Khalifa | Nadia Ghezaiel Hammouda | Verrah Akinyi Otiende | Tack Hwa Wong | Jakhongir Saydaliev | Melika Nobakhtian | Muhammad Ravi Shulthan Habibi | Chalamalasetti Kranti | Carol Muchemi | Khang Nguyen | Faisal Muhammad Adam | Luis Frentzen Salim | Reem Alqifari | Cynthia Jayne Amol | Joseph Marvin Imperial | Ilker Kesen | Ahmad Mustafid | Pavel Stepachev | Leshem Choshen | David Anugraha | Hamada Nayel | Seid Muhie Yimam | Vallerie Alexandra Putra | My Chiffon Nguyen | Azmine Toushik Wasi | Gouthami Vadithya | Rob Van Der Goot | Lanwenn ar C’horr | Karan Dua | Andrew Yates | Mithil Bangera | Yeshil Bangera | Hitesh Laxmichand Patel | Shu Okabe | Fenal Ashokbhai Ilasariya | Dmitry Gaynullin | Genta Indra Winata | Yiyuan Li | Juan Pablo Martínez | Amit Agarwal | Ikhlasul Akmal Hanif | Raia Abu Ahmad | Esther Adenuga | Filbert Aurelian Tjiaranata | Weerayut Buaphet | Michael Anugraha | Sowmya Vajjala | Benjamin L Rice | Azril Hafizi Amirudin | Jesujoba Oluwadara Alabi | Srikant Panda | Yassine Toughrai | Bruhan Kyomuhendo | Daniel Ruffinelli | Akshata | Manuel Goulão | Ej Zhou | Ingrid Gabriela Franco Ramirez | Cristina Aggazzotti | Konstantin Dobler | Jun Kevin | Quentin Pagès | Nicholas Andrews | Nuhu Ibrahim | Mattes Ruckdeschel | Amr Keleg | Mike Zhang | Casper Rufaro Muziri | Saron Samuel | Sotaro Takeshita | Kun Kerdthaisong | Luca Foppiano | Rasul Dent | Tommaso Green | Ahmad Mustapha Wali | Kamohelo Makaaka | Vicky Feliren | Inshirah Idris | Hande Celikkanat | Abdulhamid Abubakar | Jean Maillard | Benoît Sagot | Thibault Clérice | Kenton Murray | Sarah K. K. Luger
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID’s value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

2025

The ability to build and reason about models of the world is essential for situated language understanding. But evaluating world modeling capabilities in modern AI systems—especially those based on language models—has proven challenging, in large part because of the difficulty of disentangling conceptual knowledge about the world from knowledge of surface co-occurrence statistics. This paper presents Elements of World Knowledge (EWoK), a framework for evaluating language models’ understanding of the conceptual knowledge underlying world modeling. EWoK targets specific concepts from multiple knowledge domains known to be important for world modeling in humans, from social interactions (help, deceive) to spatial relations (left, right). Objects, agents, and locations in the items can be flexibly filled in, enabling easy generation of multiple controlled datasets. We then introduce EWoK-core-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 open-weights large language models (1.3B–70B parameters) and compare them with human performance. All tested models perform worse than humans, with results varying drastically across domains. Performance on social interactions and social properties was highest and performance on physical relations and spatial relations was lowest. Overall, this dataset highlights simple cases where even large models struggle and presents rich avenues for targeted research on LLM world modeling capabilities.

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs
Maxim Ifergan | Leshem Choshen | Roee Aharoni | Idan Szpektor | Omri Abend
Findings of the Association for Computational Linguistics: NAACL 2025

The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model’s ability to answer a query consistently across languages, and the ability to "store" answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs.

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Eliya Habba | Ofir Arviv | Itay Itzhak | Yotam Perlitz | Elron Bandel | Leshem Choshen | Michal Shmueli-Scheuer | Gabriel Stanovsky
Findings of the Association for Computational Linguistics: ACL 2025

Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more at: https://slab-nlp.github.io/DOVE

Findings of the Third BabyLM Challenge: Accelerating Language Modeling Research with Cognitively Plausible Data
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

This report summarizes the findings from the 3rd BabyLM Challenge. The BabyLM Challenge is a shared task aimed at closing the data efficiency gap between human and machine language learners. This year, the challenge was held as part of an expanded BabyLM Workshop that invited paper submissions on topics relevant to the BabyLM effort, including sample-efficient pretraining and cognitive modeling for LMs. For the challenge, we kept the text-only and text–image tracks from previous years, but also introduced a new interaction track, where student models are allowed to learn from feedback from larger teacher models. Furthermore, we introduce a new set of evaluation tasks to assess the “human likeness” of models on a cognitive and linguistic level, limit the total amount of training compute allowed, and measure performance on intermediate checkpoints. We observe that new training objectives and architectures tend to produce the best-performing approaches, and that interaction with teacher models can yield high-quality language models. The strict-small and interaction tracks saw submissions that outperformed the baselines. We do not observe a complete correlation between training FLOPs and performance. This year’s BabyLM Challenge shows that there is still room to innovate in a data-constrained setting, and that community-driven research can yield actionable insights for language modeling.

Proceedings of the First BabyLM Workshop
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

Reliable multilingual evaluation is difficult, and culturally appropriate evaluation is even harder to achieve.A common practice to fill this gap is to machine-translate English evaluation sets. However, translation introduces language bias and carries over cultural and regional assumptions from the original questions – often testing knowledge irrelevant to the target audience. In this work, we highlight the extent and impact of these biases and present a multilingual evaluation framework that aims to mitigate them through improved translations and annotation practices.Through a large-scale study involving professional and community translators and annotators, we show that state-of-the-art models excel primarily by learning Western-centric concepts. Notably, we find that model rankings on the full MMLU change when evaluated on a subset of questions explicitly marked as culturally sensitive.We release Global MMLU, a multilingual extension of MMLU across 42 languages, featuring improved translation quality, expanded language coverage, and designated subsets labeled as culturally sensitive and culturally agnostic to enable a more comprehensive and equitable benchmark for evaluating language models across diverse linguistic and cultural contexts.

The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community
Shachar Don-Yehiya | Leshem Choshen | Omri Abend
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Human-model conversations provide a window into users’ real-world scenarios, behavior, and needs, and thus are a valuable resource for model development and research. While for-profit companies collect user data through the APIs of their models, using it internally to improve their own models, the open source and research community lags behind. We introduce the ShareLM collection, a unified set of human conversations with large language models, and its accompanying plugin, a Web extension for voluntarily contributing user-model conversations. Where few platforms share their chats, the ShareLM plugin adds this functionality, thus, allowing users to share conversations from most platforms. The plugin allows the user to rate their conversations, both at the conversation and the response levels, and delete conversations they prefer to keep private before they ever leave the user’s local storage.

2024

Holmes ⌕ A Benchmark to Assess the Linguistic Competence of Language Models
Andreas Waldis | Yotam Perlitz | Leshem Choshen | Yufang Hou | Iryna Gurevych
Transactions of the Association for Computational Linguistics, Volume 12

We introduce Holmes, a new benchmark designed to assess language models’ (LMs’) linguistic competence—their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs’ internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs’ linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.

Efficient Benchmarking (of Language Models)
Yotam Perlitz | Elron Bandel | Ariel Gera | Ofir Arviv | Liat Ein-Dor | Eyal Shnarch | Noam Slonim | Michal Shmueli-Scheuer | Leshem Choshen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature.In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure – Decision Impact on Reliability, DIoR for short.We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples.Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Elron Bandel | Yotam Perlitz | Elad Venezian | Roni Friedman | Ofir Arviv | Matan Orbach | Shachar Don-Yehiya | Dafna Sheinwald | Ariel Gera | Leshem Choshen | Michal Shmueli-Scheuer | Yoav Katz
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution.Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt

Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)
Leshem Choshen | Ariel Gera | Yotam Perlitz | Michal Shmueli-Scheuer | Gabriel Stanovsky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries

General-Purpose Language Models have changed the world of Natural Language Processing, if not the world itself. The evaluation of such versatile models, while supposedly similar to evaluation of generation models before them, in fact presents a host of new evaluation challenges and opportunities. In this Tutorial, we will start from the building blocks of evaluation. The tutorial welcomes people from diverse backgrounds and assumes little familiarity with metrics, datasets, prompts and benchmarks. It will lay the foundations and explain the basics and their importance, while touching on the major points and breakthroughs of the recent era of evaluation. It will also compare traditional evaluation methods – which are still widely used – to newly developed methods. We will contrast new to old approaches, from evaluating on many-task benchmarks rather than on dedicated datasets to efficiency constraints, and from testing stability and prompts on in-context learning to using the models themselves as evaluation metrics. Finally, the tutorial will cover practical issues, ranging from reviewing widely-used benchmarks and prompt banks to efficient evaluation.

Jump to Conclusions: Short-Cutting Transformers with Linear Transformations
Alexander Yom Din | Taelin Karidi | Leshem Choshen | Mor Geva
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in-between. In this work, we suggest a simple method for such casting, using linear transformations. This approximation far exceeds the prevailing practice of inspecting hidden representations from all layers, in the space of the final layer. Moreover, in the context of language modeling, our method produces more accurate predictions from hidden layers, across various model scales, architectures, and data distributions. This allows “peeking” into intermediate representations, showing that GPT-2 and BERT often predict the final output already in early layers. We then demonstrate the practicality of our method to recent early exit strategies, showing that when aiming, for example, at retention of 95% accuracy, our approach saves additional 7.9% layers for GPT-2 and 5.4% layers for BERT. Last, we extend our method to linearly approximate sub-modules, finding that attention is most tolerant to this change. Our code and learned mappings are publicly available at https://github.com/sashayd/mat.

Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability
Afra Feyza Akyürek | Ekin Akyürek | Leshem Choshen | Derry Wijaya | Jacob Andreas
Findings of the Association for Computational Linguistics: ACL 2024

While language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. As a consequence, current LMs also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving LM factuality. Given a collection of seed documents, DCT prompts LMs to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. Given seed documents from a trusted source, DCT provides a tool for supervised model updating; if seed documents are sampled from the LM itself, DCT enables fully unsupervised fine-tuning for improved coherence and accuracy. Across the CREAK, MQuAKE, and Reversal Curse datasets, supervised DCT improves LM fact verification and text generation accuracy by 3-26%; on CREAK, fully unsupervised DCT improves verification accuracy by 12%. These results show that LMs’ reasoning capabilities during inference can be leveraged during training to improve their reliability.

NumeroLogic: Number Encoding for Enhanced LLMs’ Numerical Reasoning
Eli Schwartz | Leshem Choshen | Joseph Shtok | Sivan Doveh | Leonid Karlinsky | Assaf Arbelle
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose a simple adjustment to how numbers are represented by including the count of digits before each number. For instance, instead of “42”, we suggest using “2:42” as the new format. This approach, which we term NumeroLogic, offers an added advantage in number generation by serving as a Chain of Thought (CoT). By requiring the model to consider the number of digits first, it enhances the reasoning process before generating the actual number. We use arithmetic tasks to demonstrate the effectiveness of the NumeroLogic formatting. We further demonstrate NumeroLogic applicability to general natural language modeling, improving language understanding performance in the MMLU benchmark.

Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion
Kerem Zaman | Leshem Choshen | Shashank Srivastava
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Model fusion research aims to aggregate the knowledge of multiple individual models to enhance performance by combining their weights. In this work, we study the inverse problem: investigating whether model fusion can be used to reduce unwanted knowledge. We investigate the effects of model fusion in three scenarios: the learning of shortcuts, social biases, and memorization of training data in fine-tuned language models. Through experiments covering classification and generation tasks, our analysis highlights that shared knowledge among models is enhanced during model fusion, while unshared knowledge is usually forgotten. Based on this observation, we demonstrate the potential of model fusion as a debiasing tool and showcase its efficacy in addressing privacy concerns associated with language models.

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Michael Y. Hu | Aaron Mueller | Candace Ross | Adina Williams | Tal Linzen | Chengxu Zhuang | Ryan Cotterell | Leshem Choshen | Alex Warstadt | Ethan Gotlieb Wilcox
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year’s BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.

The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Michael Y. Hu | Aaron Mueller | Candace Ross | Adina Williams | Tal Linzen | Chengxu Zhuang | Leshem Choshen | Ryan Cotterell | Alex Warstadt | Ethan Gotlieb Wilcox
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.

Label-Efficient Model Selection for Text Generation
Shir Ashury Tahan | Ariel Gera | Benjamin Sznajder | Leshem Choshen | Liat Ein-Dor | Eyal Shnarch
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation.DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations – by up to 75% – while maintaining high evaluation reliability.

2023

Knowledge is a Region in Weight Space for Fine-tuned Language Models
Almog Gueta | Elad Venezian | Colin Raffel | Noam Slonim | Yoav Katz | Leshem Choshen
Findings of the Association for Computational Linguistics: EMNLP 2023

Research on neural networks has focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, particularly those trained or tested on different datasets. We address this by studying how the weight space and the underlying loss landscape of different models are interconnected. Specifically, we demonstrate that finetuned models that were optimized for high performance, reside in well-defined regions in weight space, and vice versa – that any model that resides anywhere in those regions also exhibits high performance. Notably, we show that language models that have been finetuned on the same dataset form a tight cluster in the weight space, while models finetuned on different datasets from the same underlying task form a looser cluster. Moreover, traversing around the region between the models leads to new models that perform comparably or even better than models obtained via finetuning, even on tasks that the original models were not finetuned on. Our findings provide insight into the relationships between models, demonstrating that a model positioned between two similar models can acquire the knowledge of both. We leverage this and design a method for selecting a better model for efficient finetuning. Specifically, we show that starting from the center of the region is as effective, if not more, than using the pretrained model in 11 out of 12 datasets, resulting in an average accuracy improvement of 3.06.

Where to start? Analyzing the potential value of intermediate models
Leshem Choshen | Elad Venezian | Shachar Don-Yehiya | Noam Slonim | Yoav Katz
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed independently for the target dataset under consideration, and for a base model being considered as a starting point. Hence, a performant model is generally strong, even if its training data was not aligned with the target dataset. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture.

Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney
Shachar Don-Yehiya | Leshem Choshen | Omri Abend
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midjourney. Our analysis then reveals that prompts predictably converge toward specific traits along these iterations. We further study whether this convergence is due to human users, realizing they missed important details, or due to adaptation to the model’s “preferences”, producing better images for a specific language style. We show initial evidence that both possibilities are at play. The possibility that users adapt to the model’s preference raises concerns about reusing user data for further training. The prompts may be biased towards the preferences of a specific model, rather than align with human intentions and natural manner of expression.

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

MuLER: Detailed and Scalable Reference-based Evaluation
Taelin Karidi | Leshem Choshen | Gal Patel | Omri Abend
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can lead to targeted improvement efforts for specific phenomena. We perform experiments in both synthetic and naturalistic settings to support MuLER’s validity and showcase its usability in MT evaluation, and other tasks, such as summarization. Analyzing all submissions to WMT in 2014-2020, we find consistent trends. For example, nouns and verbs are among the most frequent POS tags. However, they are among the hardest to translate. Performance on most POS tags improves with overall system performance, but a few are not thus correlated (their identity changes from language to language). Preliminary experiments with summarization reveal similar trends.

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering
Ella Neeman | Roee Aharoni | Or Honovich | Leshem Choshen | Idan Szpektor | Omri Abend
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Question answering models commonly have access to two sources of “knowledge” during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers.

ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning
Shachar Don-Yehiya | Elad Venezian | Colin Raffel | Noam Slonim | Leshem Choshen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pretraining has been shown to scale well with compute, data size and data diversity. Multitask learning trains on a mixture of supervised datasets and produces improved performance compared to self-supervised pretraining. Until now, massively multitask learning required simultaneous access to all datasets in the mixture and heavy compute resources that are only available to well-resourced teams. In this paper, we propose ColD Fusion, a method that provides the benefits of multitask learning but leverages distributed computation and requires limited communication and no sharing of data. Consequentially, ColD Fusion can create a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are based on. We show that ColD Fusion yields comparable benefits to multitask training by producing a model that (a) attains strong performance on all of the datasets it was multitask trained on and (b) is a better starting point for finetuning on unseen datasets. We find ColD Fusion outperforms RoBERTa and even previous multitask models. Specifically, when training and testing on 35 diverse datasets, ColD Fusion-based model outperforms RoBERTa by 2.19 points on average without any changes to the architecture.

2022

Semantics-aware Attention Improves Neural Machine Translation
Aviv Slobodkin | Leshem Choshen | Omri Abend
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

The integration of syntactic structures into Transformer machine translation has shown positive results, but to our knowledge, no work has attempted to do so with semantic structures. In this work we propose two novel parameter-free methods for injecting semantic information into Transformers, both rely on semantics-aware masking of (some of) the attention heads. One such method operates on the encoder, through a Scene-Aware Self-Attention (SASA) head. Another on the decoder, through a Scene-Aware Cross-Attention (SACrA) head. We show a consistent improvement over the vanilla Transformer and syntax-aware models for four language pairs. We further show an additional gain when using both semantic and syntactic structures in some language pairs.

GrASP: A Library for Extracting and Exploring Human-Interpretable Textual Patterns
Piyawat Lertvittayakumjorn | Leshem Choshen | Eyal Shnarch | Francesca Toni
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Data exploration is an important step of every data science and machine learning project, including those involving textual data. We provide a novel language tool, in the form of a publicly available Python library for extracting patterns from textual data. The library integrates a first public implementation of the existing GrASP algorithm. It allows users to extract patterns using a number of general-purpose built-in linguistic attributes (such as hypernyms, part-of-speech tags, and syntactic dependency tags), as envisaged for the original algorithm, as well as domain-specific custom attributes which can be incorporated into the library by implementing two functions. The library is equipped with a web-based interface empowering human users to conveniently explore data via the extracted patterns, using complementary pattern-centric and example-centric views: the former includes a reading in natural language and statistics of each extracted pattern; the latter shows applications of each extracted pattern to training examples. We demonstrate the usefulness of the library in classification (spam detection and argument mining), model analysis (machine translation), and artifact discovery in datasets (SNLI and 20Newsgroups).

PreQuEL: Quality Estimation of Machine Translation Outputs in Advance
Shachar Don-Yehiya | Leshem Choshen | Omri Abend
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-art.From a theoretical perspective, PreQuEL places the focus on the source text, tracing properties, possibly linguistic features, that make a sentence harder to machine translate.We develop a baseline model for the task and analyze its performance. We also develop a data augmentation method (from parallel corpora), that improves results substantially. We show that this augmentation method can improve the performance of the Quality-Estimation task as well.We investigate the properties of the input text that our model is sensitive to, by testing it on challenge sets and different languages. We conclude that it is aware of syntactic and semantic distinctions, and correlates and even over-emphasizes the importance of standard NLP features.

Label Sleuth is an open source platform for building text classifiers which does not require coding skills nor machine learning knowledge.- Project website: [https://www.label-sleuth.org/](https://www.label-sleuth.org/)- Link to screencast video: [https://vimeo.com/735675461](https://vimeo.com/735675461)### AbstractText classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a classifier generally requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier we introduce *Label Sleuth*, a free open source system for labeling and creating text classifiers. This system is unique for: - being a no-code system, making NLP accessible for non-experts. - guiding its users throughout the entire labeling process until they obtain their desired classifier, making the process efficient - from cold start to a classifier in a few hours. - being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will widen the utilization of NLP models.

Enhancing the Transformer Decoder with Transition-based Syntax
Leshem Choshen | Omri Abend
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.

On Neurons Invariant to Sentence Structural Changes in Neural Machine Translation
Gal Patel | Leshem Choshen | Omri Abend
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

We present a methodology that explores how sentence structure is reflected in neural representations of machine translation systems. We demonstrate our model-agnostic approach with the Transformer English-German translation model. We analyze neuron-level correlation of activations between paraphrases while discussing the methodology challenges and the need for confound analysis to isolate the effects of shallow cues. We find that similarity between activation patterns can be mostly accounted for by similarity in word choice and sentence length. Following that, we manipulate neuron activations to control the syntactic form of the output. We show this intervention to be somewhat successful, indicating that deep models capture sentence-structure distinctions, despite finding no such indication at the neuron level. To conduct our experiments, we develop a semi-automatic method to generate meaning-preserving minimal pair paraphrases (active-passive voice and adverbial clause-noun phrase) and compile a corpus of such pairs.

Reinforcement Learning with Large Action Spaces for Neural Machine Translation
Asaf Yehudai | Leshem Choshen | Lior Fox | Omri Abend
Proceedings of the 29th International Conference on Computational Linguistics

Applying Reinforcement learning (RL) following maximum likelihood estimation (MLE) pre-training is a versatile method for enhancing neural machine translation (NMT) performance. However, recent work has argued that the gains produced by RL for NMT are mostly due to promoting tokens that have already received a fairly high probability in pre-training. We hypothesize that the large action space is a main obstacle to RL’s effectiveness in MT, and conduct two sets of experiments that lend support to our hypothesis. First, we find that reducing the size of the vocabulary improves RL’s effectiveness. Second, we find that effectively reducing the dimension of the action space without changing the vocabulary also yields notable improvement as evaluated by BLEU, semantic similarity, and human evaluation. Indeed, by initializing the network’s final fully connected layer (that maps the network’s internal dimension to the vocabulary dimension), with a layer that generalizes over similar actions, we obtain a substantial improvement in RL performance: 1.5 BLEU points on average.

The Grammar-Learning Trajectories of Neural Language Models
Leshem Choshen | Guy Hacohen | Daphna Weinshall | Omri Abend
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The learning trajectories of linguistic phenomena in humans provide insight into linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply a similar approach to analyze neural language models (NLM), it is first necessary to establish that different models are similar enough in the generalizations they make. In this paper, we show that NLMs with different initialization, architecture, and training data acquire linguistic phenomena in a similar order, despite their different end performance. These findings suggest that there is some mutual inductive bias that underlies these models’ learning of linguistic phenomena. Taking inspiration from psycholinguistics, we argue that studying this inductive bias is an opportunity to study the linguistic representation implicit in NLMs.Leveraging these findings, we compare the relative performance on different phenomena at varying learning stages with simpler reference models. Results suggest that NLMs exhibit consistent “developmental” stages. Moreover, we find the learning trajectory to be approximately one-dimensional: given an NLM with a certain overall performance, it is possible to predict what linguistic generalizations it has already acquired. Initial analysis of these stages presents phenomena clusters (notably morphological ones), whose performance progresses in unison, suggesting a potential link between the generalizations behind them.

Cluster & Tune: Boost Cold Start Performance in Text Classification
Eyal Shnarch | Ariel Gera | Alon Halfon | Lena Dankin | Leshem Choshen | Ranit Aharonov | Noam Slonim
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pre-training and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.

2021

Mediators in Determining what Processing BERT Performs First
Aviv Slobodkin | Leshem Choshen | Omri Abend
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Probing neural models for the ability to perform downstream tasks using their activation patterns is often used to localize what parts of the network specialize in performing what tasks. However, little work addressed potential mediating factors in such comparisons. As a test-case mediating factor, we consider the prediction’s context length, namely the length of the span whose processing is minimally required to perform the prediction. We show that not controlling for context length may lead to contradictory conclusions as to the localization patterns of the network, depending on the distribution of the probing dataset. Indeed, when probing BERT with seven tasks, we find that it is possible to get 196 different rankings between them when manipulating the distribution of context lengths in the probing dataset. We conclude by presenting best practices for conducting such comparisons in the future.

Q²: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
Or Honovich | Leshem Choshen | Roee Aharoni | Ella Neeman | Idan Szpektor | Omri Abend
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted Q², compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of Q² against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements.

2020

Unsupervised Expressive Rules Provide Explainability and Assist Human Experts Grasping New Domains
Eyal Shnarch | Leshem Choshen | Guy Moshkowich | Ranit Aharonov | Noam Slonim
Findings of the Association for Computational Linguistics: EMNLP 2020

Approaching new data can be quite deterrent; you do not know how your categories of interest are realized in it, commonly, there is no labeled data at hand, and the performance of domain adaptation methods is unsatisfactory. Aiming to assist domain experts in their first steps into a new task over a new corpus, we present an unsupervised approach to reveal complex rules which cluster the unexplored corpus by its prominent categories (or facets). These rules are human-readable, thus providing an important ingredient which has become in short supply lately - explainability. Each rule provides an explanation for the commonality of all the texts it clusters together. The experts can then identify which rules best capture texts of their categories of interest, and utilize them to deepen their understanding of these categories. These rules can also bootstrap the process of data labeling by pointing at a subset of the corpus which is enriched with texts demonstrating the target categories. We present an extensive evaluation of the usefulness of these rules in identifying target categories, as well as a user study which assesses their interpretability.

Active Learning for BERT: An Empirical Study
Liat Ein-Dor | Alon Halfon | Ariel Gera | Eyal Shnarch | Lena Dankin | Leshem Choshen | Marina Danilevsky | Ranit Aharonov | Yoav Katz | Noam Slonim
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Real world scenarios present a challenge for text classification, since labels are usually expensive and the data is often characterized by class imbalance. Active Learning (AL) is a ubiquitous paradigm to cope with data scarcity. Recently, pre-trained NLP models, and BERT in particular, are receiving massive attention due to their outstanding performance in various NLP tasks. However, the use of AL with deep pre-trained models has so far received little consideration. Here, we present a large-scale empirical study on active learning techniques for BERT-based classification, addressing a diverse set of AL strategies and datasets. We focus on practical scenarios of binary text classification, where the annotation budget is very small, and the data is often skewed. Our results demonstrate that AL can boost BERT performance, especially in the most realistic scenario in which the initial set of labeled examples is created using keyword-based queries, resulting in a biased sample of the minority class. We release our research framework, aiming to facilitate future research along the lines explored here.

Classifying Syntactic Errors in Learner Language
Leshem Choshen | Dmitry Nikolaev | Yevgeni Berzak | Omri Abend
Proceedings of the 24th Conference on Computational Natural Language Learning

We present a method for classifying syntactic errors in learner language, namely errors whose correction alters the morphosyntactic structure of a sentence. The methodology builds on the established Universal Dependencies syntactic representation scheme, and provides complementary information to other error-classification systems. Unlike existing error classification methods, our method is applicable across languages, which we showcase by producing a detailed picture of syntactic errors in learner English and learner Russian. We further demonstrate the utility of the methodology for analyzing the outputs of leading Grammatical Error Correction (GEC) systems.

2019

Learning to combine Grammatical Error Corrections
Yoav Kantor | Yoav Katz | Leshem Choshen | Edo Cohen-Karlik | Naftali Liberman | Assaf Toledo | Amir Menczel | Noam Slonim
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

The field of Grammatical Error Correction (GEC) has produced various systems to deal with focused phenomena or general text editing. We propose an automatic way to combine black-box systems. Our method automatically detects the strength of a system or the combination of several systems per error type, improving precision and recall while optimizing F-score directly. We show consistent improvement over the best standalone system in all the configurations tested. This approach also outperforms average ensembling of different RNN models with random initializations. In addition, we analyze the use of BERT for GEC - reporting promising results on this end. We also present a spellchecker created for this task which outperforms standard spellcheckers tested on the task of spellchecking. This paper describes a system submission to Building Educational Applications 2019 Shared Task: Grammatical Error Correction. Combining the output of top BEA 2019 shared task systems using our approach, currently holds the highest reported score in the open phase of the BEA 2019 shared task, improving F-0.5 score by 3.7 points over the best result reported.

SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA
Daniel Hershcovich | Zohar Aizenbud | Leshem Choshen | Elior Sulem | Ari Rappoport | Omri Abend
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the SemEval 2019 shared task on Universal Conceptual Cognitive Annotation (UCCA) parsing in English, German and French, and discuss the participating systems and results. UCCA is a cross-linguistically applicable framework for semantic representation, which builds on extensive typological work and supports rapid annotation. UCCA poses a challenge for existing parsing techniques, as it exhibits reentrancy (resulting in DAG structures), discontinuous structures and non-terminal nodes corresponding to complex semantic units. The shared task has yielded improvements over the state-of-the-art baseline in all languages and settings. Full results can be found in the task’s website https://competitions.codalab.org/competitions/19160.

The Language of Legal and Illegal Activity on the Darknet
Leshem Choshen | Dan Eldad | Daniel Hershcovich | Elior Sulem | Omri Abend
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have, and how well do off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drugs-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags, and the coverage of their named entities in Wikipedia.

Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network
Martin Gleize | Eyal Shnarch | Leshem Choshen | Lena Dankin | Guy Moshkowich | Ranit Aharonov | Noam Slonim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

With the advancement in argument detection, we suggest to pay more attention to the challenging task of identifying the more convincing arguments. Machines capable of responding and interacting with humans in helpful ways have become ubiquitous. We now expect them to discuss with us the more delicate questions in our world, and they should do so armed with effective arguments. But what makes an argument more persuasive? What will convince you? In this paper, we present a new data set, IBM-EviConv, of pairs of evidence labeled for convincingness, designed to be more challenging than existing alternatives. We also propose a Siamese neural network architecture shown to outperform several baselines on both a prior convincingness data set and our own. Finally, we provide insights into our experimental results and the various kinds of argumentative value our method is capable of detecting.

Automatically Extracting Challenge Sets for Non-Local Phenomena in Neural Machine Translation
Leshem Choshen | Omri Abend
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We show that the state-of-the-art Transformer MT model is not biased towards monotonic reordering (unlike previous recurrent neural network models), but that nevertheless, long-distance dependencies remain a challenge for the model. Since most dependencies are short-distance, common evaluation metrics will be little influenced by how well systems perform on them. We therefore propose an automatic approach for extracting challenge sets rich with long-distance dependencies, and argue that evaluation using this methodology provides a complementary perspective on system performance. To support our claim, we compile challenge sets for English-German and German-English, which are much larger than any previously released challenge set for MT. The extracted sets are large enough to allow reliable automatic evaluation, which makes the proposed approach a scalable and practical solution for evaluating MT performance on the long-tail of syntactic phenomena.

2018

Will it Blend? Blending Weak and Strong Labeled Data in a Neural Network for Argumentation Mining
Eyal Shnarch | Carlos Alzate | Lena Dankin | Martin Gleize | Yufang Hou | Leshem Choshen | Ranit Aharonov | Noam Slonim
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

The process of obtaining high quality labeled data for natural language understanding tasks is often slow, error-prone, complicated and expensive. With the vast usage of neural networks, this issue becomes more notorious since these networks require a large amount of labeled data to produce satisfactory results. We propose a methodology to blend high quality but scarce strong labeled data with noisy but abundant weak labeled data during the training of neural networks. Experiments in the context of topic-dependent evidence detection with two forms of weak labeled data show the advantages of the blending scheme. In addition, we provide a manually annotated data set for the task of topic-dependent evidence detection. We believe that blending weak and strong labeled data is a general notion that may be applicable to many language understanding tasks, and can especially assist researchers who wish to train a network but have a small amount of high quality labeled data for their task of interest.

Automatic Metric Validation for Grammatical Error Correction
Leshem Choshen | Omri Abend
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Metric validation in Grammatical Error Correction (GEC) is currently done by observing the correlation between human and metric-induced rankings. However, such correlation studies are costly, methodologically troublesome, and suffer from low inter-rater agreement. We propose MAEGE, an automatic methodology for GEC metric validation, that overcomes many of the difficulties in the existing methodology. Experiments with MAEGE shed a new light on metric quality, showing for example that the standard M² metric fares poorly on corpus-level ranking. Moreover, we use MAEGE to perform a detailed analysis of metric behavior, showing that some types of valid edits are consistently penalized by existing metrics.

Inherent Biases in Reference-based Evaluation for Grammatical Error Correction
Leshem Choshen | Omri Abend
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The prevalent use of too few references for evaluating text-to-text generation is known to bias estimates of their quality (henceforth, low coverage bias or LCB). This paper shows that overcoming LCB in Grammatical Error Correction (GEC) evaluation cannot be attained by re-scaling or by increasing the number of references in any feasible range, contrary to previous suggestions. This is due to the long-tailed distribution of valid corrections for a sentence. Concretely, we show that LCB incentivizes GEC systems to avoid correcting even when they can generate a valid correction. Consequently, existing systems obtain comparable or superior performance compared to humans, by making few but targeted changes to the input. Similar effects on Text Simplification further support our claims.

Reference-less Measure of Faithfulness for Grammatical Error Correction
Leshem Choshen | Omri Abend
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose USim, a semantic measure for Grammatical Error Correction (that measures the semantic faithfulness of the output to the source, thereby complementing existing reference-less measures (RLMs) for measuring the output’s grammaticality. USim operates by comparing the semantic symbolic structure of the source and the correction, without relying on manually-curated references. Our experiments establish the validity of USim, by showing that the semantic structures can be consistently applied to ungrammatical text, that valid corrections obtain a high USim similarity score to the source, and that invalid corrections obtain a lower score.

Co-authors

Alex Warstadt 7

Ranit Aharonov 6

Ryan Cotterell 6

Aaron Mueller 6

Yotam Perlitz 6

Adina Williams 6

Michal Shmueli-Scheuer 5

Michael Y. Hu 4

Elad Venezian 4

Chengxu Zhuang 4

Rafael Mosquera 3

Idan Szpektor 3

Ekin Akyürek 2

Jacob Andreas 2

Marina Danilevsky 2

Lucas Georges Gabriel Charpentier 2

Martin Gleize 2

Mustafa Omer Gul 2

Daniel Hershcovich 2

Taelin Karidi 2

Naftali Liberman 2

Jing Liu (刘晶, 刘璟) 2

Guy Moshkowich 2

Bhargavi Paranjabe 2

Raj Sanjay Shah 2

Aviv Slobodkin 2

Gabriel Stanovsky 2

Idris Abdulmumin 1

Abdulhamid Abubakar 1

Faisal Muhammad Adam 1

David Ifeoluwa Adelani 1

Esther Adenuga 1

Cristina Aggazzotti 1

Raia Abu Ahmad 1

Zohar Aizenbud 1

Afra Feyza Akyürek 1

Hend Al-Khalifa 1

Jesujoba Alabi 1

Vallerie Alexandra Putra 1

Reem Alqifari 1

Carlos Alzate 1

Azril Hafizi Amirudin 1

Cynthia Jayne Amol 1

Nicholas Andrews 1

David Anugraha 1

Michael Anugraha 1

Assaf Arbelle 1

Catherine Arnett 1

Shir Ashury Tahan 1

Mithil Bangera 1

Yeshil Bangera 1

Yonatan Belinkov 1

Yevgeni Berzak 1

Arianna Bisazza 1

Antoine Bosselut 1

Weerayut Buaphet 1

Bastian Bunzeck 1

Laurie Burchell 1

Jon Ander Campos 1

Hande Celikkanat 1

Thomas H. Clark 1

Thibault Clérice 1

Edo Cohen-Karlik 1

Lanwenn ar C’horr 1

Konstantin Dobler 1

Luca D’Amico-Wong 1

Dina Epelboim 1

Julen Etxaniz 1

Marzieh Fadaee 1

Evelina Fedorenko 1

Vicky Feliren 1

Enzo Ferrante 1

Elisabeth Fittschen 1

Luca Foppiano 1

Negar Foroutan 1

Clémentine Fourrier 1

Abdellah Fourtassi 1

Roni Friedman 1

Diana Galván-Sosa 1

Iker García-Ferrero 1

Dmitry Gaynullin 1

Shahriar Golchin 1

Yoav Goldberg 1

Manuel Goulão 1

Gabriel Grand 1

María Grandury 1

Tommaso Green 1

Kristjan Greenewald 1

Iryna Gurevych 1

Muhammad Ravi Shulthan Habibi 1

Nadia Ghezaiel Hammouda 1

Ikhlasul Akmal Hanif 1

Faiz Ghifari Haznitrama 1

Inshirah Idris 1

Maxim Ifergan 1

Fenal Ashokbhai Ilasariya 1

Joseph Marvin Imperial 1

Daphne Ippolito 1

Anna A. Ivanova 1

Leonid Karlinsky 1

Yannis Katsis 1

Kun Kerdthaisong 1

Chalamalasetti Kranti 1

Unnathi U. Kumar 1

Bruhan Kyomuhendo 1

Piyawat Lertvittayakumjorn 1

Philip Levin Slesarev 1

Sabrina Xin Li 1

Peerat Limkonchotiwat 1

Benjamin Lipkin 1

Tom Lippincott 1

Shayne Longpre 1

Sarah K. K. Luger 1

Jean Maillard 1

Kamohelo Makaaka 1

Mila Marcheva 1

Kelly Marchisio 1

Vukosi Marivate 1

André F. T. Martins 1

Juan Pablo Martínez 1

Craig Messner 1

Francois Meyer 1

Sara Hincapié Monsalve 1

Carol Muchemi 1

Shamsuddeen Hassan Muhammad 1

Kenton Murray 1

Ahmad Mustafid 1

Casper Rufaro Muziri 1

Gwilym Newton 1

Jian Gang Ngui 1

My Chiffon Nguyen 1

Dmitry Nikolaev 1

Melika Nobakhtian 1

Shila Ofek-Koifman 1

Pedro Ortiz Suarez 1

Malte Ostendorff 1

Verrah Akinyi Otiende 1

Francesca Padovani 1

Quentin Pagès 1

Bhavish Pahwa 1

Srikant Panda 1

Hitesh Laxmichand Patel 1

Vivian C. Paulun 1

Laurent Prévot 1

Setayesh Radkani 1

Ingrid Gabriela Franco Ramirez 1

Ari Rappoport 1

Nafisa Rashid 1

Benjamin L Rice 1

Angelika Romanou 1

Mattes Ruckdeschel 1

Sebastian Ruder 1

Daniel Ruffinelli 1

Maria Ryskina 1

Pouya Sadeghi 1

Benoît Sagot 1

Suchir Salhan 1

Luis Frentzen Salim 1

Martín Santillán Cooper 1

Jakhongir Saydaliev 1

Bhargav Shandilya 1

Suryansh Sharma 1

Dafna Sheinwald 1

Ilya Shnayderman 1

Emily Silcock 1

Shivalika Singh 1

Kateryna Solonko 1

Shashank Srivastava 1

Pavel Stepachev 1

Damian Stewart 1

Mihai Surdeanu 1

Yosephine Susanto 1

Benjamin Sznajder 1

Sotaro Takeshita 1

Joshua Tenenbaum 1

Nikitas Theodoropoulos 1

Filbert Aurelian Tjiaranata 1

Francesca Toni 1

Atnafu Lambebo Tonja 1

Yassine Toughrai 1

Vishaal Udandarao 1

Gouthami Vadithya 1

Sowmya Vajjala 1

Rob Van Der Goot 1

Daniel Vila-Suero 1

Andreas Waldis 1

Ahmad Mustapha Wali 1

Azmine Toushik Wasi 1

Daphna Weinshall 1

Derry Tanti Wijaya 1

Ethan G. Wilcox 1

Genta Indra Winata 1

Tack Hwa Wong 1

Seid Muhie Yimam 1

Alexander Yom Din 1

Venues