Marc Dymetman


2023

pdf bib
Should you marginalize over possible tokenizations?
Nadezhda Chirkova | Germán Kruszewski | Jos Rozen | Marc Dymetman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

pdf bib
disco: a toolkit for Distributional Control of Generative Models
Germán Kruszewski | Jos Rozen | Marc Dymetman
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Pre-trained language models and other generative models have revolutionized NLP and beyond. However, these models tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging to capture. To address these limitations, researchers have introduced distributional control techniques. These techniques, not limited to language, allow controlling the prevalence (i.e. expectations) of any features of interest in the model’s outputs. Despite their potential, the widespread adoption of these techniques has been hindered by the difficulty in adapting the complex, disconnected code. Here, we present disco, an open-source Python library that brings these techniques to the broader public

2019

pdf bib
Machine Translation of Restaurant Reviews: New Corpus for Domain Adaptation and Robustness
Alexandre Berard | Ioan Calapodescu | Marc Dymetman | Claude Roux | Jean-Luc Meunier | Vassilina Nikoulina
Proceedings of the 3rd Workshop on Neural Generation and Translation

We share a French-English parallel corpus of Foursquare restaurant reviews, and define a new task to encourage research on Neural Machine Translation robustness and domain adaptation, in a real-world scenario where better-quality MT would be greatly beneficial. We discuss the challenges of such user-generated content, and train good baseline models that build upon the latest techniques for MT robustness. We also perform an extensive evaluation (automatic and human) that shows significant improvements over existing online systems. Finally, we propose task-specific metrics based on sentiment analysis or translation accuracy of domain-specific polysemous words.

pdf bib
Global Autoregressive Models for Data-Efficient Sequence Learning
Tetiana Parshakova | Jean-Marc Andreoli | Marc Dymetman
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an unnormalized GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one.

2018

pdf bib
Char2char Generation with Reranking for the E2E NLG Challenge
Shubham Agarwal | Marc Dymetman | Éric Gaussier
Proceedings of the 11th International Conference on Natural Language Generation

This paper describes our submission to the E2E NLG Challenge. Recently, neural seq2seq approaches have become mainstream in NLG, often resorting to pre- (respectively post-) processing delexicalization (relexicalization) steps at the word-level to handle rare words. By contrast, we train a simple character level seq2seq model, which requires no pre/post-processing (delexicalization, tokenization or even lowercasing), with surprisingly good results. For further improvement, we explore two re-ranking approaches for scoring candidates. We also introduce a synthetic dataset creation procedure, which opens up a new way of creating artificial datasets for Natural Language Generation.

2017

pdf bib
A surprisingly effective out-of-the-box char2char model on the E2E NLG Challenge dataset
Shubham Agarwal | Marc Dymetman
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

We train a char2char model on the E2E NLG Challenge data, by exploiting “out-of-the-box” the recently released tfseq2seq framework, using some of the standard options offered by this tool. With minimal effort, and in particular without delexicalization, tokenization or lowercasing, the obtained raw predictions, according to a small scale human evaluation, are excellent on the linguistic side and quite reasonable on the adequacy side, the primary downside being the possible omissions of semantic material. However, in a significant number of cases (more than 70%), a perfect solution can be found in the top-20 predictions, indicating promising directions for solving the remaining issues.

2016

pdf bib
Natural Language Generation through Character-based RNNs with Finite-state Prior Knowledge
Raghav Goyal | Marc Dymetman | Eric Gaussier
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Recently Wen et al. (2015) have proposed a Recurrent Neural Network (RNN) approach to the generation of utterances from dialog acts, and shown that although their model requires less effort to develop than a rule-based system, it is able to improve certain aspects of the utterances, in particular their naturalness. However their system employs generation at the word-level, which requires one to pre-process the data by substituting named entities with placeholders. This pre-processing prevents the model from handling some contextual effects and from managing multiple occurrences of the same attribute. Our approach uses a character-level model, which unlike the word-level model makes it possible to learn to “copy” information from the dialog act to the target without having to pre-process the input. In order to avoid generating non-words and inventing information not present in the input, we propose a method for incorporating prior knowledge into the RNN in the form of a weighted finite-state automaton over character sequences. Automatic and human evaluations show improved performance over baselines on several evaluation criteria.

pdf bib
Sequence-based Structured Prediction for Semantic Parsing
Chunyang Xiao | Marc Dymetman | Claire Gardent
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Orthogonality regularizer for question answering
Chunyang Xiao | Guillaume Bouchard | Marc Dymetman | Claire Gardent
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

pdf bib
LSTM-Based Mixture-of-Experts for Knowledge-Aware Dialogues
Phong Le | Marc Dymetman | Jean-Michel Renders
Proceedings of the 1st Workshop on Representation Learning for NLP

2015

pdf bib
Adaptation par enrichissement terminologique en traduction automatique statistique fondée sur la génération et le filtrage de bi-segments virtuels
Christophe Servan | Marc Dymetman
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Nous présentons des travaux préliminaires sur une approche permettant d’ajouter des termes bilingues à un système de Traduction Automatique Statistique (TAS) à base de segments. Les termes sont non seulement inclus individuellement, mais aussi avec des contextes les englobant. Tout d’abord nous générons ces contextes en généralisant des motifs (ou patrons) observés pour des mots de même nature syntaxique dans un corpus bilingue. Enfin, nous filtrons les contextes qui n’atteignent pas un certain seuil de confiance, à l’aide d’une méthode de sélection de bi-segments inspirée d’une approche de sélection de données, précédemment appliquée à des textes bilingues alignés.

pdf bib
Reversibility reconsidered: finite-state factors for efficient probabilistic sampling in parsing and generation
Marc Dymetman | Sriram Venkatapathy | Chunyang Xiao
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Comparison of data selection techniques for the translation of video lectures
Joern Wuebker | Hermann Ney | Adrià Martínez-Villaronga | Adrià Giménez | Alfons Juan | Christophe Servan | Marc Dymetman | Shachar Mirkin
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

For the task of online translation of scientific video lectures, using huge models is not possible. In order to get smaller and efficient models, we perform data selection. In this paper, we perform a qualitative and quantitative comparison of several data selection techniques, based on cross-entropy and infrequent n-gram criteria. In terms of BLEU, a combination of translation and language model cross-entropy achieves the most stable results. As another important criterion for measuring translation quality in our application, we identify the number of out-of-vocabulary words. Here, infrequent n-gram recovery shows superior performance. Finally, we combine the two selection techniques in order to benefit from both their strengths.

pdf bib
Exact Decoding for Phrase-Based Statistical Machine Translation
Wilker Aziz | Marc Dymetman | Lucia Specia
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
A Lightweight Terminology Verification Service for External Machine Translation Engines
Alessio Bosca | Vassilina Nikoulina | Marc Dymetman
Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics

2013

pdf bib
Confidence-driven Rewriting for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman
Proceedings of Machine Translation Summit XIV: Posters

pdf bib
SORT: An Interactive Source-Rewriting Tool for Improved Translation
Shachar Mirkin | Sriram Venkatapathy | Marc Dymetman | Ioan Calapodescu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
A Convexity-based Generalization of Viterbi for Non-Deterministic Weighted Automata
Marc Dymetman
Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing

pdf bib
Investigations in Exact Inference for Hierarchical Translation
Wilker Aziz | Marc Dymetman | Sriram Venkatapathy
Proceedings of the Eighth Workshop on Statistical Machine Translation

2012

pdf bib
Exact Sampling and Decoding in High-Order Hidden Markov Models
Simon Carter | Marc Dymetman | Guillaume Bouchard
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Prediction of Learning Curves in Machine Translation
Prasanth Kolachina | Nicola Cancedda | Marc Dymetman | Sriram Venkatapathy
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation
Vassilina Nikoulina | Agnes Sandor | Marc Dymetman
Proceedings of the Second Workshop on Applying Machine Learning Techniques to Optimise the Division of Labour in Hybrid MT

pdf bib
Optimization and Sampling for NLP from a Unified Viewpoint
Marc Dymetman | Guillaume Bouchard | Simon Carter
Proceedings of the First International Workshop on Optimization Techniques for Human Language Technology

2010

pdf bib
Machine Translation Using Overlapping Alignments and SampleRank
Benjamin Roth | Andrew McCallum | Marc Dymetman | Nicola Cancedda
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers

We present a conditional-random-field approach to discriminatively-trained phrase-based machine translation in which training and decoding are both cast in a sampling framework and are implemented uniformly in a new probabilistic programming language for factor graphs. In traditional phrase-based translation, decoding infers both a "Viterbi" alignment and the target sentence. In contrast, in our approach, a rich overlapping-phrase alignment is produced by a fast deterministic method, while probabilistic decoding infers only the target sentence, which is then able to leverage arbitrary features of the entire source sentence, target sentence and alignment. By using SampleRank for learning we could in principle efficiently estimate hundreds of thousands of parameters. Test-time decoding is done by MCMC sampling with annealing. To demonstrate the potential of our approach we show preliminary experiments leveraging alignments that may contain overlapping bi-phrases.

pdf bib
Learning an Expert from Human Annotations in Statistical Machine Translation: the Case of Out-of-Vocabulary Words
Wilker Aziz | Marc Dymetman | Lucia Specia | Shachar Mirkin
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf bib
A Dataset for Assessing Machine Translation Evaluation Metrics
Lucia Specia | Nicola Cancedda | Marc Dymetman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We describe a dataset containing 16,000 translations produced by four machine translation systems and manually annotated for quality by professional translators. This dataset can be used in a range of tasks assessing machine translation evaluation metrics, from basic correlation analysis to training and test of machine learning-based metrics. By providing a standard dataset for such tasks, we hope to encourage the development of better MT evaluation metrics.

pdf bib
Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and Algorithms
Marc Dymetman | Nicola Cancedda
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation

2009

pdf bib
Estimating the Sentence-Level Quality of Machine Translation Systems
Lucia Specia | Marco Turchi | Nicola Cancedda | Nello Cristianini | Marc Dymetman
Proceedings of the 13th Annual Conference of the European Association for Machine Translation

pdf bib
Sentence-level confidence estimation for MT
Lucia Specia | Nicola Cancedda | Marc Dymetman | Craig Saunders | Marco Turchi | Nello Cristianini | Zhuoran Wang | John Shawe-Taylor
Proceedings of the 13th Annual conference of the European Association for Machine Translation

pdf bib
Complexity-Based Phrase-Table Filtering for Statistical Machine Translation
Nadi Tomeh | Nicola Cancedda | Marc Dymetman
Proceedings of Machine Translation Summit XII: Papers

pdf bib
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem
Mikhail Zaslavskiy | Marc Dymetman | Nicola Cancedda
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Source-Language Entailment Modeling for Translating Unknown Terms
Shachar Mirkin | Lucia Specia | Nicola Cancedda | Ido Dagan | Marc Dymetman | Idan Szpektor
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Using Syntactic Coupling Features for Discriminating Phrase-Based Translations (WMT-08 Shared Translation Task)
Vassilina Nikoulina | Marc Dymetman
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Experiments in Discriminating Phrase-Based Translations on the Basis of Syntactic Coupling Features
Vassilina Nikoulina | Marc Dymetman
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)

2005

pdf bib
Une approche à la traduction automatique statistique par segments discontinus
Michel Simard | Nicola Cancedda | Bruno Cavestro | Marc Dymetman | Eric Gaussier | Cyril Goutte | Philippe Langlais | Arne Mauser | Kenji Yamada
Actes de la 12ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Cet article présente une méthode de traduction automatique statistique basée sur des segments non-continus, c’est-à-dire des segments formés de mots qui ne se présentent pas nécéssairement de façon contiguë dans le texte. On propose une méthode pour produire de tels segments à partir de corpus alignés au niveau des mots. On présente également un modèle de traduction statistique capable de tenir compte de tels segments, de même qu’une méthode d’apprentissage des paramètres du modèle visant à maximiser l’exactitude des traductions produites, telle que mesurée avec la métrique NIST. Les traductions optimales sont produites par le biais d’une recherche en faisceau. On présente finalement des résultats expérimentaux, qui démontrent comment la méthode proposée permet une meilleure généralisation à partir des données d’entraînement.

pdf bib
Translating with Non-contiguous Phrases
Michel Simard | Nicola Cancedda | Bruno Cavestro | Marc Dymetman | Eric Gaussier | Cyril Goutte | Kenji Yamada | Philippe Langlais | Arne Mauser
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2003

pdf bib
MDA-XML : une expérience de rédaction contrôlée multilingue basée sur XML
Guy Lapalme | Caroline Brun | Marc Dymetman
Actes de la 10ème conférence sur le Traitement Automatique des Langues Naturelles. Posters

Nous décrivons dans cet article l’implantation d’un système de rédaction contrôlée multilingue dans un environnement XML. Avec ce système, un auteur rédige interactivement un texte se conformant à des règles de bonne formation aux niveaux du contenu sémantique et de la réalisation linguistique décrites par un schéma XML. Nous discutons les avantages de cette approche ainsi que les difficultés rencontrées lors du développement de ce système. Nous concluons avec un exemple d’application à une classe de documents pharmaceutiques.

pdf bib
Controlled Authoring of Biological Experiment Reports
Caroline Brun | Marc Dymetman | Eric Fanchon | Stanislas Lhomme
Demonstrations

pdf bib
Towards Interactive Text Understanding
Marc Dymetman | Aurélien Max | Kenji Yamada
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics

2002

pdf bib
Text Authoring, Knowledge Acquisition and Description Logics
Marc Dymetman
COLING 2002: The 19th International Conference on Computational Linguistics

2000

pdf bib
XML and Multilingual Document Authoring: Convergent Trends
Marc Dymetman | Veronika Lux | Aarne Ranta
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Context-Free Grammar Rewriting and the Transfer of Packed Linguistic Representations
Marc Dymetman | Frederic Tendeau
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
Document structure and multilingual authoring
Caroline Brun | Marc Dymetman | Veronika Lux
INLG’2000 Proceedings of the First International Conference on Natural Language Generation

1998

pdf bib
Group Theory and Linguistic Processing
Marc Dymetman
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics

pdf bib
Group Theory and Linguistic Processing
Marc Dymetman
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1

1996

pdf bib
Extended Dependency Structures and their Formal Interpretation
Marc Dymetman | Max Copperman
COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics

1994

pdf bib
A Simple Transformation for Offline-Parsable Grammars and its Termination Properties
Marc Dymetman
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

pdf bib
Translation Analysis and Translation Automation
Pierre Isabelle | Marc Dymetman | George Foster | Jean-Marc Jutras | Elliott
Proceedings of the Fifth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

1992

pdf bib
A Generalized Greibach Normal Form for Definite Clause Grammars
Marc Dymetman
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

1991

pdf bib
Inherently Reversible Grammars, Logic Programming and Computability
Marc Dymetman
Reversible Grammar in Natural Language Processing

1990

pdf bib
A Symmetrical Approach to Parsing and Generation
Marc Dymetman | Pierre Isabelle | Francois Perrault
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics

1988

pdf bib
Reversible logic grammars for machine translation
Marc Dymetman | Pierre Isabelle
Proceedings of the Second Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf bib
CRITTER: a translation system for agricultural market reports
Pierre Isabelle | Marc Dymetman | Elliott Macklovitch
Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics

1986

pdf bib
Two Approaches to Commonsense Inferencing for Discourse Analysis
Marc Dymetman
Coling 1986 Volume 1: The 11th International Conference on Computational Linguistics