Shachar Mirkin


2022

pdf bib
Emergent Structures and Training Dynamics in Large Language Models
Ryan Teehan | Miruna Clinciu | Oleg Serikov | Eliza Szczechla | Natasha Seelam | Shachar Mirkin | Aaron Gokaslan
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Large language models have achieved success on a number of downstream tasks, particularly in a few and zero-shot manner. As a consequence, researchers have been investigating both the kind of information these networks learn and how such information can be encoded in the parameters of the model. We survey the literature on changes in the network during training, drawing from work outside of NLP when necessary, and on learned representations of linguistic features in large language models. We note in particular the lack of sufficient research on the emergence of functional units, subsections of the network where related functions are grouped or organised, within large language models and motivate future work that grounds the study of language models in an analysis of their changing internal structure during training time.

2019

pdf bib
A Dataset of General-Purpose Rebuttal
Matan Orbach | Yonatan Bilu | Ariel Gera | Yoav Kantor | Lena Dankin | Tamar Lavee | Lili Kotlerman | Shachar Mirkin | Michal Jacovi | Ranit Aharonov | Noam Slonim
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In Natural Language Understanding, the task of response generation is usually focused on responses to short texts, such as tweets or a turn in a dialog. Here we present a novel task of producing a critical response to a long argumentative text, and suggest a method based on general rebuttal arguments to address it. We do this in the context of the recently-suggested task of listening comprehension over argumentative content: given a speech on some specified topic, and a list of relevant arguments, the goal is to determine which of the arguments appear in the speech. The general rebuttals we describe here (in English) overcome the need for topic-specific arguments to be provided, by proving to be applicable for a large set of topics. This allows creating responses beyond the scope of topics for which specific arguments are available. All data collected during this work is freely available for research.

2018

pdf bib
A Recorded Debating Dataset
Shachar Mirkin | Michal Jacovi | Tamar Lavee | Hong-Kwang Kuo | Samuel Thomas | Leslie Sager | Lili Kotlerman | Elad Venezian | Noam Slonim
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Listening Comprehension over Argumentative Content
Shachar Mirkin | Guy Moshkowich | Matan Orbach | Lili Kotlerman | Yoav Kantor | Tamar Lavee | Michal Jacovi | Yonatan Bilu | Ranit Aharonov | Noam Slonim
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper presents a task for machine listening comprehension in the argumentation domain and a corresponding dataset in English. We recorded 200 spontaneous speeches arguing for or against 50 controversial topics. For each speech, we formulated a question, aimed at confirming or rejecting the occurrence of potential arguments in the speech. Labels were collected by listening to the speech and marking which arguments were mentioned by the speaker. We applied baseline methods addressing the task, to be used as a benchmark for future work over this dataset. All data used in this work is freely available for research.

2017

pdf bib
Personalized Machine Translation: Preserving Original Author Traits
Ella Rabinovich | Raj Nath Patel | Shachar Mirkin | Lucia Specia | Shuly Wintner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author’s gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.

2015

pdf bib
Motivating Personality-aware Machine Translation
Shachar Mirkin | Scott Nowson | Caroline Brun | Julien Perez
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Personalized Machine Translation: Predicting Translational Preferences
Shachar Mirkin | Jean-Luc Meunier
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Incrementally Updating the SMT Reordering Model
Shachar Mirkin
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Comparison of data selection techniques for the translation of video lectures
Joern Wuebker | Hermann Ney | Adrià Martínez-Villaronga | Adrià Giménez | Alfons Juan | Christophe Servan | Marc Dymetman | Shachar Mirkin
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.

pdf bib
Data selection for compact adapted SMT models
Shachar Mirkin | Laurent Besacier
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

pdf bib
Text Summarization through Entailment-based Minimum Vertex Cover
Anand Gupta | Manpreet Kaur | Shachar Mirkin | Adarsh Singh | Aseem Goyal
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

2013

pdf bib
Confidence-driven Rewriting for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
SORT: An Interactive Source-Rewriting Tool for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman | Ioan Calapodescu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Assessing quick update methods of statistical translation models
Shachar Mirkin | Nicola Cancedda
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

The ability to quickly incorporate incoming training data into a running translation system is critical in a number of applications. Mechanisms based on incremental model update and the online EM algorithm hold the promise of achieving this objective in a principled way. Still, efficient tools for incremental training are yet to be available. In this paper we experiment with simple alternative solutions for interim model updates, within the popular Moses system. Short of updating the model in real time, such updates can execute in short timeframes even when operating on large models, and achieve a performance level close to, and in some cases exceeding, that of batch retraining.

2012

pdf bib
An SMT-driven Authoring Tool
Sriram Venkatapathy | Shachar Mirkin
Proceedings of COLING 2012: Demonstration Papers

2011

pdf bib
Classification-based Contextual Preferences
Shachar Mirkin | Ido Dagan | Lili Kotlerman | Idan Szpektor
Proceedings of the TextInfer 2011 Workshop on Textual Entailment

2010

pdf bib
Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words
Wilker Aziz | Marc Dymetman | Lucia Specia | Shachar Mirkin
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf bib
A Resource for Investigating the Impact of Anaphora and Coreference on Inference.
Azad Abad | Luisa Bentivogli | Ido Dagan | Danilo Giampiccolo | Shachar Mirkin | Emanuele Pianta | Asher Stern
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated.. A by-product of the annotation is a new “augmented” data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.

pdf bib
Assessing the Role of Discourse References in Entailment Inference
Shachar Mirkin | Ido Dagan | Sebastian Padó
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Recognising Entailment within Discourse
Shachar Mirkin | Jonathan Berant | Ido Dagan | Eyal Shnarch
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Evaluating the Inferential Utility of Lexical-Semantic Resources
Shachar Mirkin | Ido Dagan | Eyal Shnarch
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2006

pdf bib
Integrating Pattern-Based and Distributional Similarity Methods for Lexical Entailment Acquisition
Shachar Mirkin | Ido Dagan | Maayan Geffet
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions