Zdeněk Kasner


2022

pdf bib
Neural Pipeline for Zero-Shot Data-to-Text Generation
Zdeněk Kasner | Ondrej Dusek
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In data-to-text (D2T) generation, training on in-domain data leads to overfitting to the data representation and repeating training data noise. We examine how to avoid finetuning pretrained language models (PLMs) on D2T generation datasets while still taking advantage of surface realization capabilities of PLMs. Inspired by pipeline approaches, we propose to generate text by transforming single-item descriptions with a sequence of modules trained on general-domain text-based operations: ordering, aggregation, and paragraph compression. We train PLMs for performing these operations on a synthetic corpus WikiFluent which we build from English Wikipedia. Our experiments on two major triple-to-text datasets—WebNLG and E2E—show that our approach enables D2T generation from RDF triples in zero-shot settings.

2021

pdf bib
Text-in-Context: Token-Level Error Detection for Table-to-Text Generation
Zdeněk Kasner | Simon Mille | Ondřej Dušek
Proceedings of the 14th International Conference on Natural Language Generation

We present our Charles-UPF submission for the Shared Task on Evaluating Accuracy in Generated Texts at INLG 2021. Our system can detect the errors automatically using a combination of a rule-based natural language generation (NLG) system and pretrained language models (LMs). We first utilize a rule-based NLG system to generate sentences with facts that can be derived from the input. For each sentence we evaluate, we select a subset of facts which are relevant by measuring semantic similarity to the sentence in question. Finally, we finetune a pretrained language model on annotated data along with the relevant facts for fine-grained error detection. On the test set, we achieve 69% recall and 75% precision with a model trained on a mixture of human-annotated and synthetic data.

2020

pdf bib
Data-to-Text Generation with Iterative Text Editing
Zdeněk Kasner | Ondřej Dušek
Proceedings of the 13th International Conference on Natural Language Generation

We present a novel approach to data-to-text generation based on iterative text editing. Our approach maximizes the completeness and semantic accuracy of the output text while leveraging the abilities of recent pre-trained models for text editing (LaserTagger) and language modeling (GPT-2) to improve the text fluency. To this end, we first transform data items to text using trivial templates, and then we iteratively improve the resulting text by a neural model trained for the sentence fusion task. The output of the model is filtered by a simple heuristic and reranked with an off-the-shelf pre-trained language model. We evaluate our approach on two major data-to-text datasets (WebNLG, Cleaned E2E) and analyze its caveats and benefits. Furthermore, we show that our formulation of data-to-text generation opens up the possibility for zero-shot domain adaptation using a general-domain dataset for sentence fusion.

pdf bib
Evaluating Semantic Accuracy of Data-to-Text Generation with Natural Language Inference
Ondřej Dušek | Zdeněk Kasner
Proceedings of the 13th International Conference on Natural Language Generation

A major challenge in evaluating data-to-text (D2T) generation is measuring the semantic accuracy of the generated text, i.e. checking if the output text contains all and only facts supported by the input data. We propose a new metric for evaluating the semantic accuracy of D2T generation based on a neural model pretrained for natural language inference (NLI). We use the NLI model to check textual entailment between the input data and the output text in both directions, allowing us to reveal omissions or hallucinations. Input data are converted to text for NLI using trivial templates. Our experiments on two recent D2T datasets show that our metric can achieve high accuracy in identifying erroneous system outputs.

pdf bib
Train Hard, Finetune Easy: Multilingual Denoising for RDF-to-Text Generation
Zdeněk Kasner | Ondřej Dušek
Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+)

We describe our system for the RDF-to-text generation task of the WebNLG Challenge 2020. We base our approach on the mBART model, which is pre-trained for multilingual denoising. This allows us to use a simple, identical, end-to-end setup for both English and Russian. Requiring minimal taskor languagespecific effort, our model placed in the first third of the leaderboard for English and first or second for Russian on automatic metrics, and it made it into the best or second-best system cluster on human evaluation.

pdf bib
Expand and Filter: CUNI and LMU Systems for the WNGT 2020 Duolingo Shared Task
Jindřich Libovický | Zdeněk Kasner | Jindřich Helcl | Ondřej Dušek
Proceedings of the Fourth Workshop on Neural Generation and Translation

We present our submission to the Simultaneous Translation And Paraphrase for Language Education (STAPLE) challenge. We used a standard Transformer model for translation, with a crosslingual classifier predicting correct translations on the output n-best list. To increase the diversity of the outputs, we used additional data to train the translation model, and we trained a paraphrasing model based on the Levenshtein Transformer architecture to generate further synonymous translations. The paraphrasing results were again filtered using our classifier. While the use of additional data and our classifier filter were able to improve results, the paraphrasing model produced too many invalid outputs to further improve the output quality. Our model without the paraphrasing component finished in the middle of the field for the shared task, improving over the best baseline by a margin of 10-22 % weighted F1 absolute.