Shachar Mirkin

2022

pdf bib abs
Emergent Structures and Training Dynamics in Large Language Models
Ryan Teehan | Miruna Clinciu | Oleg Serikov | Eliza Szczechla | Natasha Seelam | Shachar Mirkin | Aaron Gokaslan
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models

Large language models have achieved success on a number of downstream tasks, particularly in a few and zero-shot manner. As a consequence, researchers have been investigating both the kind of information these networks learn and how such information can be encoded in the parameters of the model. We survey the literature on changes in the network during training, drawing from work outside of NLP when necessary, and on learned representations of linguistic features in large language models. We note in particular the lack of sufficient research on the emergence of functional units, subsections of the network where related functions are grouped or organised, within large language models and motivate future work that grounds the study of language models in an analysis of their changing internal structure during training time.

2019

In Natural Language Understanding, the task of response generation is usually focused on responses to short texts, such as tweets or a turn in a dialog. Here we present a novel task of producing a critical response to a long argumentative text, and suggest a method based on general rebuttal arguments to address it. We do this in the context of the recently-suggested task of listening comprehension over argumentative content: given a speech on some specified topic, and a list of relevant arguments, the goal is to determine which of the arguments appear in the speech. The general rebuttals we describe here (in English) overcome the need for topic-specific arguments to be provided, by proving to be applicable for a large set of topics. This allows creating responses beyond the scope of topics for which specific arguments are available. All data collected during this work is freely available for research.

2018

This paper presents a task for machine listening comprehension in the argumentation domain and a corresponding dataset in English. We recorded 200 spontaneous speeches arguing for or against 50 controversial topics. For each speech, we formulated a question, aimed at confirming or rejecting the occurrence of potential arguments in the speech. Labels were collected by listening to the speech and marking which arguments were mentioned by the speaker. We applied baseline methods addressing the task, to be used as a benchmark for future work over this dataset. All data used in this work is freely available for research.

2017

pdf bib abs
Personalized Machine Translation: Preserving Original Author Traits
Ella Rabinovich | Raj Nath Patel | Shachar Mirkin | Lucia Specia | Shuly Wintner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author’s gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.

2015

pdf bib
Motivating Personality-aware Machine Translation
Shachar Mirkin | Scott Nowson | Caroline Brun | Julien Perez
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Personalized Machine Translation: Predicting Translational Preferences
Shachar Mirkin | Jean-Luc Meunier
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.

pdf bib abs
Data selection for compact adapted SMT models
Shachar Mirkin | Laurent Besacier
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

pdf bib
Text Summarization through Entailment-based Minimum Vertex Cover
Anand Gupta | Manpreet Kaur | Shachar Mirkin | Adarsh Singh | Aseem Goyal
Proceedings of the Third Joint Conference on Lexical and Computational Semantics (*SEM 2014)

pdf bib
Incrementally Updating the SMT Reordering Model
Shachar Mirkin
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

2013

pdf bib abs
Assessing quick update methods of statistical translation models
Shachar Mirkin | Nicola Cancedda
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

The ability to quickly incorporate incoming training data into a running translation system is critical in a number of applications. Mechanisms based on incremental model update and the online EM algorithm hold the promise of achieving this objective in a principled way. Still, efficient tools for incremental training are yet to be available. In this paper we experiment with simple alternative solutions for interim model updates, within the popular Moses system. Short of updating the model in real time, such updates can execute in short timeframes even when operating on large models, and achieve a performance level close to, and in some cases exceeding, that of batch retraining.

pdf bib
Confidence-driven Rewriting for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
SORT: An Interactive Source-Rewriting Tool for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman | Ioan Calapodescu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Discourse phenomena play a major role in text processing tasks. However, so far relatively little study has been devoted to the relevance of discourse phenomena for inference. Therefore, an experimental study was carried out to assess the relevance of anaphora and coreference for Textual Entailment (TE), a prominent inference framework. First, the annotation of anaphoric and coreferential links in the RTE-5 Search data set was performed according to a specifically designed annotation scheme. As a result, a new data set was created where all anaphora and coreference instances in the entailing sentences which are relevant to the entailment judgment are solved and annotated.. A by-product of the annotation is a new augmented data set, where all the referring expressions which need to be resolved in the entailing sentences are replaced by explicit expressions. Starting from the final output of the annotation, the actual impact of discourse phenomena on inference engines was investigated, identifying the kind of operations that the systems need to apply to address discourse phenomena and trying to find direct mappings between these operation and annotation types.

pdf bib
Assessing the Role of Discourse References in Entailment Inference
Shachar Mirkin | Ido Dagan | Sebastian Padó
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics