Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges

Samira Shaikh, Thiago Ferreira, Amanda Stent (Editors)

Anthology ID:
Waterville, Maine, USA and virtual meeting
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges
Samira Shaikh | Thiago Ferreira | Amanda Stent

pdf bib
The Second Automatic Minuting (AutoMin) Challenge: Generating and Evaluating Minutes from Multi-Party Meetings
Tirthankar Ghosal | Marie Hledíková | Muskaan Singh | Anna Nedoluzhko | Ondřej Bojar

We would host the AutoMin generation chal- lenge at INLG 2023 as a follow-up of the first AutoMin shared task at Interspeech 2021. Our shared task primarily concerns the automated generation of meeting minutes from multi-party meeting transcripts. In our first venture, we ob- served the difficulty of the task and highlighted a number of open problems for the community to discuss, attempt, and solve. Hence, we invite the Natural Language Generation (NLG) com- munity to take part in the second iteration of AutoMin. Like the first, the second AutoMin will feature both English and Czech meetings and the core task of summarizing the manually- revised transcripts into bulleted minutes. A new challenge we are introducing this year is to devise efficient metrics for evaluating the quality of minutes. We will also host an optional track to generate minutes for European parliamentary sessions. We carefully curated the datasets for the above tasks. Our ELITR Minuting Corpus has been recently accepted to LREC 2022 and publicly released. We are already preparing a new test set for evaluating the new shared tasks. We hope to carry forward the learning from the first AutoMin and instigate more community attention and interest in this timely yet chal- lenging problem. INLG, the premier forum for the NLG community, would be an appropriate venue to discuss the challenges and future of Automatic Minuting. The main objective of the AutoMin GenChal at INLG 2023 would be to come up with efficient methods to auto- matically generate meeting minutes and design evaluation metrics to measure the quality of the minutes.

pdf bib
The Cross-lingual Conversation Summarization Challenge
Yulong Chen | Ming Zhong | Xuefeng Bai | Naihao Deng | Jing Li | Xianchao Zhu | Yue Zhang

We propose the shared task of cross-lingual conversation summarization, ConvSumX Challenge, opening new avenues for researchers to investigate solutions that integrate conversation summarization and machine translation. This task can be particularly useful due to the emergence of online meetings and conferences. We use a new benchmark, covering 2 real-world scenarios and 3 language directions, including a low-resource language, for evaluation. We hope that ConvSumX can motivate research to go beyond English and break the barrier for non-English speakers to benefit from recent advances of conversation summarization.

pdf bib
HinglishEval Generation Challenge on Quality Estimation of Synthetic Code-Mixed Text: Overview and Results
Vivek Srivastava | Mayank Singh

We hosted a shared task to investigate the factors influencing the quality of the code- mixed text generation systems. The teams experimented with two systems that gener- ate synthetic code-mixed Hinglish sentences. They also experimented with human ratings that evaluate the generation quality of the two systems. The first-of-its-kind, proposed sub- tasks, (i) quality rating prediction and (ii) an- notators’ disagreement prediction of the syn- thetic Hinglish dataset made the shared task quite popular among the multilingual research community. A total of 46 participants com- prising 23 teams from 18 institutions reg- istered for this shared task. The detailed description of the task and the leaderboard is available at

pdf bib
PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics & Language Model Embeddings To Estimate Code-Mix Quality
Prashant Kodali | Tanmay Sachan | Akshay Goindani | Anmol Goel | Naman Ahuja | Manish Shrivastava | Ponnurangam Kumaraguru

Code-Mixing is a phenomenon of mixing two or more languages in a speech event and is prevalent in multilingual societies. Given the low-resource nature of Code-Mixing, machine generation of code-mixed text is a prevalent approach for data augmentation. However, evaluating the quality of such machine gen- erated code-mixed text is an open problem. In our submission to HinglishEval, a shared- task collocated with INLG2022, we attempt to build models factors that impact the quality of synthetically generated code-mix text by pre- dicting ratings for code-mix quality. Hingli- shEval Shared Task consists of two sub-tasks - a) Quality rating prediction); b) Disagree- ment prediction. We leverage popular code- mixed metrics and embeddings of multilin- gual large language models (MLLMs) as fea- tures, and train task specific MLP regression models. Our approach could not beat the baseline results. However, for Subtask-A our team ranked a close second on F-1 and Co- hen’s Kappa Score measures and first for Mean Squared Error measure. For Subtask-B our ap- proach ranked third for F1 score, and first for Mean Squared Error measure. Code of our submission can be accessed here.

pdf bib
niksss at HinglishEval: Language-agnostic BERT-based Contextual Embeddings with Catboost for Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text
Nikhil Singh

This paper describes the system description for the HinglishEval challenge at INLG 2022. The goal of this task was to investigate the factors influencing the quality of the code- mixed text generation system. The task was divided into two subtasks, quality rating prediction and annotators’ disagreement prediction of the synthetic Hinglish dataset. We attempted to solve these tasks using sentence-level embeddings, which are obtained from mean pooling the contextualized word embeddings for all input tokens in our text. We experimented with various classifiers on top of the embeddings produced for respective tasks. Our best-performing system ranked 1st on subtask B and 3rd on subtask A. We make our code available here:

pdf bib
BITS Pilani at HinglishEval: Quality Evaluation for Code-Mixed Hinglish Text Using Transformers
Shaz Furniturewala | Vijay Kumari | Amulya Ratna Dash | Hriday Kedia | Yashvardhan Sharma

Code-Mixed text data consists of sentences having words or phrases from more than one language. Most multi-lingual communities worldwide communicate using multiple languages, with English usually one of them. Hinglish is a Code-Mixed text composed of Hindi and English but written in Roman script. This paper aims to determine the factors influencing the quality of Code-Mixed text data generated by the system. For the HinglishEval task, the proposed model uses multilingual BERT to find the similarity between synthetically generated and human-generated sentences to predict the quality of synthetically generated Hinglish sentences.

pdf bib
JU_NLP at HinglishEval: Quality Evaluation of the Low-Resource Code-Mixed Hinglish Text
Prantik Guha | Rudra Dhar | Dipankar Das

In this paper we describe a system submitted to the INLG 2022 Generation Challenge (GenChal) on Quality Evaluation of the Low-Resource Synthetically Generated Code-Mixed Hinglish Text. We implement a Bi-LSTM-based neural network model to predict the Average rating score and Disagreement score of the synthetic Hinglish dataset. In our models, we used word embeddings for English and Hindi data, and one hot encodings for Hinglish data. We achieved a F1 score of 0.11, and mean squared error of 6.0 in the average rating score prediction task. In the task of Disagreement score prediction, we achieve a F1 score of 0.18, and mean squared error of 5.0.

pdf bib
The 2022 ReproGen Shared Task on Reproducibility of Evaluations in NLG: Overview and Results
Anya Belz | Anastasia Shimorina | Maja Popović | Ehud Reiter

Against a background of growing interest in reproducibility in NLP and ML, and as part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP, we organised the second shared task on reproducibility of evaluations in NLG, ReproGen 2022. This paper describes the shared task, summarises results from the reproduction studies submitted, and provides further comparative analysis of the results. Out of six initial team registrations, we received submissions from five teams. Meta-analysis of the five reproduction studies revealed varying degrees of reproducibility, and allowed further tentative conclusions about what types of evaluation tend to have better reproducibility.

pdf bib
Two Reproductions of a Human-Assessed Comparative Evaluation of a Semantic Error Detection System
Rudali Huidrom | Ondřej Dušek | Zdeněk Kasner | Thiago Castro Ferreira | Anya Belz

In this paper, we present the results of two reproduction studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against reference outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of reproducibility depending on result type, data and labelling task. Our resources are available and open-sourced.

pdf bib
Reproducibility of Exploring Neural Text Simplification Models: A Review
Mohammad Arvan | Luís Pina | Natalie Parde

The reproducibility of NLP research has drawn increased attention over the last few years. Several tools, guidelines, and metrics have been introduced to address concerns in regard to this problem; however, much work still remains to ensure widespread adoption of effective reproducibility standards. In this work, we review the reproducibility of Exploring Neural Text Simplification Models by Nisioi et al. (2017), evaluating it from three main aspects: data, software artifacts, and automatic evaluations. We discuss the challenges and issues we faced during this process. Furthermore, we explore the adequacy of current reproducibility standards. Our code, trained models, and a docker container of the environment used for training and evaluation are made publicly available.

pdf bib
The Accuracy Evaluation Shared Task as a Retrospective Reproduction Study
Craig Thomson | Ehud Reiter

We investigate the data collected for the Accuracy Evaluation Shared Task as a retrospective reproduction study. The shared task was based upon errors found by human annotation of computer generated summaries of basketball games. Annotation was performed in three separate stages, with texts taken from the same three systems and checked for errors by the same three annotators. We show that the mean count of errors was consistent at the highest level for each experiment, with increased variance when looking at per-system and/or per-error- type breakdowns.

pdf bib
Reproducing a Manual Evaluation of the Simplicity of Text Simplification System Outputs
Maja Popović | Sheila Castilho | Rudali Huidrom | Anya Belz

In this paper we describe our reproduction study of the human evaluation of text simplic- ity reported by Nisioi et al. (2017). The work was carried out as part of the ReproGen Shared Task 2022 on Reproducibility of Evaluations in NLG. Our aim was to repeat the evaluation of simplicity for nine automatic text simplification systems with a different set of evaluators. We describe our experimental design together with the known aspects of the original experimental design and present the results from both studies. Pearson correlation between the original and reproduction scores is moderate to high (0.776). Inter-annotator agreement in the reproduction study is lower (0.40) than in the original study (0.66). We discuss challenges arising from the unavailability of certain aspects of the origi- nal set-up, and make several suggestions as to how reproduction of similar evaluations can be made easier in future.

pdf bib
A reproduction study of methods for evaluating dialogue system output: Replicating Santhanam and Shaikh (2019)
Anouck Braggaar | Frédéric Tomas | Peter Blomsma | Saar Hommes | Nadine Braun | Emiel van Miltenburg | Chris van der Lee | Martijn Goudbeek | Emiel Krahmer

In this paper, we describe our reproduction ef- fort of the paper: Towards Best Experiment Design for Evaluating Dialogue System Output by Santhanam and Shaikh (2019) for the 2022 ReproGen shared task. We aim to produce the same results, using different human evaluators, and a different implementation of the automatic metrics used in the original paper. Although overall the study posed some challenges to re- produce (e.g. difficulties with reproduction of automatic metrics and statistics), in the end we did find that the results generally replicate the findings of Santhanam and Shaikh (2019) and seem to follow similar trends.

pdf bib
DialogSum Challenge: Results of the Dialogue Summarization Shared Task
Yulong Chen | Naihao Deng | Yang Liu | Yue Zhang

We report the results of DialogSum Challenge, the shared task on summarizing real-life sce- nario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different meth- ods to improve the performance of dialogue summarization. Although there is a great im- provement over the baseline models regarding automatic evaluation metrics, such as ROUGE scores, we find that there is a salient gap be- tween model generated outputs and human an- notated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion met- rics are in need.

pdf bib
TCS_WITM_2022 @ DialogSum : Topic oriented Summarization using Transformer based Encoder Decoder Model
Vipul Chauhan | Prasenjeet Roy | Lipika Dey | Tushar Goel

In this paper, we present our approach to the DialogSum challenge, which was proposed as a shared task aimed to summarize dialogues from real-life scenarios. The challenge was to design a system that can generate fluent and salient summaries of a multi-turn dialogue text. Dialogue summarization has many commercial applications as it can be used to summarize conversations between customers and service agents, meeting notes, conference proceedings etc. Appropriate dialogue summarization can enhance the experience of conversing with chat- bots or personal digital assistants. We have pro- posed a topic-based abstractive summarization method, which is generated by fine-tuning PE- GASUS1, which is the state of the art abstrac- tive summary generation model. We have com- pared different types of fine-tuning approaches that can lead to different types of summaries. We found that since conversations usually veer around a topic, using topics along with the di- aloagues, helps to generate more human-like summaries. The topics in this case resemble user perspective, around which summaries are usually sought. The generated summary has been evaluated with ground truth summaries provided by the challenge owners. We use the py-rouge score and BERT-Score metrics to compare the results.

pdf bib
A Multi-Task Learning Approach for Summarization of Dialogues
Saprativa Bhattacharjee | Kartik Shinde | Tirthankar Ghosal | Asif Ekbal

We describe our multi-task learning based ap- proach for summarization of real-life dialogues as part of the DialogSum Challenge shared task at INLG 2022. Our approach intends to im- prove the main task of abstractive summariza- tion of dialogues through the auxiliary tasks of extractive summarization, novelty detection and language modeling. We conduct extensive experimentation with different combinations of tasks and compare the results. In addition, we also incorporate the topic information provided with the dataset to perform topic-aware sum- marization. We report the results of automatic evaluation of the generated summaries in terms of ROUGE and BERTScore.

pdf bib
Dialogue Summarization using BART
Conrad Lundberg | Leyre Sánchez Viñuela | Siena Biales

This paper introduces the model and settings submitted to the INLG 2022 DialogSum Chal- lenge, a shared task to generate summaries of real-life scenario dialogues between two peo- ple. In this paper, we explored using interme- diate task transfer learning, reported speech, and the use of a supplementary dataset in addi- tion to our base fine-tuned BART model. How- ever, we did not use such a method in our final model, as none improved our results. Our final model for this dialogue task achieved scores only slightly below the top submission, with hidden test set scores of 49.62, 24.98, 46.25 and 91.54 for ROUGE-1, ROUGE-2, ROUGE-L and BERTSCORE respectively. The top submitted models will also receive human evaluation.