Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

Simon Mille (Editor)

Anthology ID:
Prague, Czechia
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
Simon Mille

pdf bib
LOWRECORP: the Low-Resource NLG Corpus Building Challenge
Khyathi Raghavi Chandu | David M. Howcroft | Dimitra Gkatzia | Yi-Ling Chung | Yufang Hou | Chris Chinenye Emezue | Pawan Rajpoot | Tosin Adewumi

Most languages in the world do not have sufficient data available to develop neural-network-based natural language generation (NLG) systems. To alleviate this resource scarcity, we propose a novel challenge for the NLG community: low-resource language corpus development (LOWRECORP). We present an innovative framework to collect a single dataset with dual tasks to maximize the efficiency of data collection efforts and respect language consultant time. Specifically, we focus on a text-chat-based interface for two generation tasks – conversational response generation grounded in a source document and/or image and dialogue summarization (from the former task). The goal of this shared task is to collectively develop grounded datasets for local and low-resourced languages. To enable data collection, we make available web-based software that can be used to collect these grounded conversations and summaries. Submissions will be assessed for the size, complexity, and diversity of the corpora to ensure quality control of the datasets as well as any enhancements to the interface or novel approaches to grounding conversations.

pdf bib
Long Story Generation Challenge
Nikolay Mikhaylovskiy

We propose a shared task of human-like long story generation, LSG Challenge, that asks models to output a consistent human-like long story (a Harry Potter generic audience fanfic in English), given a prompt of about 1K tokens. We suggest a novel statistical metric of the text structuredness, GloVe Autocorrelations Power/ Exponential Law Mean Absolute Percentage Error Ratio (GAPELMAPER) and the use of previously-known UNION metric and a human evaluation protocol. We hope that LSG can open new avenues for researchers to investigate sampling approaches, prompting strategies, autoregressive and non-autoregressive text generation architectures and break the barrier to generate consistent long (40K+ word) texts.

pdf bib
Visually Grounded Story Generation Challenge
Xudong Hong | Khushboo Mehra | Asad Sayeed | Vera Demberg

Recent large pre-trained models have achieved strong performance in multimodal language generation, which requires a joint effort of vision and language modeling. However, most previous generation tasks are based on single image input and produce short text descriptions that are not grounded on the input images. In this work, we propose a shared task on visually grounded story generation. The input is an image sequence, and the output is a story that is conditioned on the input images. This task is particularly challenging because: 1) the protagonists in the generated stories need to be grounded in the images and 2) the output story should be a coherent long-form text. We aim to advance the study of vision-based story generation by accepting submissions that propose new methods as well as new evaluation measures.

pdf bib
The VDG Challenge: Response Generation and Evaluation in Collaborative Visual Dialogue
Nikolai Ilinykh | Simon Dobnik

We propose the VDG Challenge: a shared task that addresses and benchmarks the task of utterance generation in collaborative visual dialogue. The task features two challenging datasets, an evaluation protocol and a tentative schedule. Our shared task will allow researchers to unravel problems of modelling multi-modal interaction and fit of the existing approaches in the NLP and NLG communities.

pdf bib
Identifying Feedback Types to Augment Feedback Comment Generation
Maja Stahl | Henning Wachsmuth

In the context of language learning, feedback comment generation is the task of generating hints or explanatory notes for learner texts that help understand why a part of text is erroneous. This paper presents our approach to the Feedback Comment Generation Shared Task, collocated with the 16th International Natural Language Generation Conference (INLG 2023). The approach augments the generation of feedback comments by a self-supervised identification of feedback types in a multitasklearning setting. Within the shared task, other approaches performed more effective, yet the combined modeling of feedback type classification and feedback comment generation is superior to performing eedback generation only.

pdf bib
Error syntax aware augmentation of feedback comment generation dataset
Nikolay Babakov | Maria Lysyuk | Alexander Shvets | Lilya Kazakova | Alexander Panchenko

This paper presents a solution to the GenChal 2022 shared task dedicated to feedback comment generation for writing learning. In terms of this task given a text with an error and a span of the error, a system generates an explanatory note that helps the writer (language learner) to improve their writing skills. Our solution is based on fine-tuning the T5 model on the initial dataset augmented according to syntactical dependencies of the words located within indicated error span. The solution of our team ‘nigula’ obtained second place according to manual evaluation by the organizers.

pdf bib
A Report on FCG GenChal 2022: Shared Task on Feedback Comment Generation for Language Learners
Ryo Nagata | Masato Hagiwara | Kazuaki Hanawa | Masato Mita

We report on the results of the first ever shared task on feedback comment generation for language learners held as Generation Challenge (GenChal) in INLG 2022, which we call FCG GenChal. Feedback comment generation for language learners is a task where, given a text and a span, a system generates, for the span, an explanatory note that helps the writer (language learner) improve their writing skills. We show how well we can generate feedback comments with present techniques. We also shed light on the task properties and the difficulties in this task, with insights into the task including data development, evaluation, and comparisons of generation systems.

pdf bib
Sentence-level Feedback Generation for English Language Learners: Does Data Augmentation Help?
Shabnam Behzad | Amir Zeldes | Nathan Schneider

In this paper, we present strong baselines for the task of Feedback Comment Generation for Writing Learning. Given a sentence and an error span, the task is to generate a feedback comment explaining the error. Sentences and feedback comments are both in English. We experiment with LLMs and also create multiple pseudo datasets for the task, investigating how it affects the performance of our system. We present our results for the task along with extensive analysis of the generated comments with the aim of aiding future studies in feedback comment generation for English language learners.

pdf bib
Retrieval, Masking, and Generation: Feedback Comment Generation using Masked Comment Examples
Mana Ihori | Hiroshi Sato | Tomohiro Tanaka | Ryo Masumura

In this paper, we propose a novel method, retrieval, masking, and generation, for feedback comment generation. Feedback comment generation is a task in which a system generates feedback comments such as hints or explanatory notes for language learners, given input text and position showing where to comment. In the conventional study, the retrieve-and-edit method for retrieving feedback comments in the data pool and editing the comments has been thought effective for this task. However, the performance of this method does not perform as well as other conventional methods because its model learns to edit tokens that do not need to be rewritten in the retrieved comments. To mitigate this problem, we propose a method for combining retrieval, masking, and generation based on the retrieve-and-edit method. Specifically, tokens of feedback comments retrieved from the data pool are masked, and this masked feedback comment is used as a template to generate feedback comments. The proposed method should prevent unnecessary conversion by using not retrieved feedback comments directly but masking them. Our experiments on feedback comment generation demonstrate that the proposed method outperforms conventional methods.

pdf bib
TMU Feedback Comment Generation System Using Pretrained Sequence-to-Sequence Language Models
Naoya Ueda | Mamoru Komachi

In this paper, we introduce our Tokyo Metropolitan University Feedback Comment Generation system submitted to the feedback comment generation task for INLG 2023 Generation Challenge. In this task, a source sentence and offset range of preposition uses are given as the input. Then, a system generates hints or explanatory notes about preposition uses as the output. To tackle this generation task, we finetuned pretrained sequence-to-sequence language models. The models using BART and T5 showed significant improvement in BLEU score, demonstrating the effectiveness of the pretrained sequence-to-sequence language models in this task. We found that using part-of-speech tag information as an auxiliary input improves the generation quality of feedback comments. Furthermore, we adopt a simple postprocessing method that can enhance the reliability of the generation. As a result, our system achieved the F1 score of 47.4 points in BLEU-based evaluation and 60.9 points in manual evaluation, which ranked second and third on the leaderboard.

pdf bib
The Tokyo Tech and AIST System at the GenChal 2022 Shared Task on Feedback Comment Generation
Shota Koyama | Hiroya Takamura | Naoaki Okazaki

This paper describes the Tokyo Tech and AIST system in the GenChal 2022 shared task, which is the first shared task of feedback comment generation. We adopted five methods: data cleaning, fine-tuning pre-trained models, correcting errors in learners’ sentences, appending a correcting operation, and filtering out irrelevant outputs. Our system achieved F1 = 43.4 on the test dataset.

pdf bib
Feedback comment generation using predicted grammatical terms
Kunitaka Jimichi | Kotaro Funakoshi | Manabu Okumura

The purpose of feedback comment generation is to provide useful feedback comments for a wide range of errors in learners’ essays from a language learning perspective. Since it is difficult to obtain appropriate comments at a practical level with rule-based or retrieval- based methods, we explore neural-based gen- erative methods with pre-trained models. We further assume the effectiveness of consider- ing grammatical terms in generating feedback comments. Specifically, this paper proposes T5-based models using predicted grammati- cal terms, submitted to FCG GenChal, and presents their results. By using correct gram- matical terms, our model could improve the BLEU score by 19.0 points, compared with the baseline T5 without grammatical terms on the development dataset. Furthermore, by using predicted grammatical terms, our model could improve the manual evaluation score by 2.33 points, compared with the baseline T5 without grammatical terms on the test dataset.

pdf bib
AIWolfDial 2023: Summary of Natural Language Division of 5th International AIWolf Contest
Yoshinobu Kano | Neo Watanabe | Kaito Kagaminuma | Claus Aranha | Jaewon Lee | Benedek Hauer | Hisaichi Shibata | Soichiro Miki | Yuta Nakamura | Takuya Okubo | Soga Shigemura | Rei Ito | Kazuki Takashima | Tomoki Fukuda | Masahiro Wakutani | Tomoya Hatanaka | Mami Uchida | Mikio Abe | Akihiro Mikami | Takashi Otsuki | Zhiyang Qi | Kei Harada | Michimasa Inaba | Daisuke Katagami | Hirotaka Osawa | Fujio Toriumi

We held our 5th annual AIWolf international contest to automatically play the Werewolf game “Mafia”, where players try finding liars via conversations, aiming at promoting developments in creating agents of more natural conversations in higher level, such as longer contexts, personal relationships, semantics, pragmatics, and logics, revealing the capabilities and limits of the generative AIs. In our Natural Language Division of the contest, we had six Japanese speaking agents from five teams, and three English speaking agents, to mutually run games. By using the game logs, We performed human subjective evaluations and detailed log analysis. We found that the entire system performance has largely improved over the previous year, due to the recent advantages of the LLMs. However, it is not perfect at all yet; the generated talks are sometimes inconsistent with the game actions, it is still doubtful that the agents could infer roles by logics rather than superficial utterance generations. It is not explicitly observed in this log but it would be still difficult to make an agent telling a lie, pretend as a villager but it has an opposite goal inside. Our future work includes to reveal the capability of the LLMs, whether they can make the duality of the “liar”, in other words, holding a “true” and a “false” circumstances of the agent at the same time, even holding what these circumstances look like from other agents.

pdf bib
Team Zoom @ AutoMin 2023: Utilizing Topic Segmentation And LLM Data Augmentation For Long-Form Meeting Summarization
Felix Schneider | Marco Turchi

This paper describes Zoom’s submission to the Second Shared Task on Automatic Minuting at INLG 2023. We participated in Task A: generating abstractive summaries of meetings. Our final submission was a transformer model utilizing data from a similar domain and data augmentation by large language models, as well as content-based segmentation. The model produces summaries covering meeting topics and next steps and performs comparably to a large language model at a fraction of the cost. We also find that re-summarizing the summaries with the same model allows for an alternative, shorter summary.

pdf bib
Team Synapse @ AutoMin 2023: Leveraging BART-Based Models for Automatic Meeting Minuting
Kristýna Klesnilová | Michelle Elizabeth

This paper describes the approach we followed for our submission to the Second Run of the Automatic Minuting Shared Task. Our methodology centers around employing BART-based models fine-tuned on diverse summarization corpora. The segmented meeting transcripts are fed into the models, generating summaries that are subsequently combined and formatted into the final meeting minutes.

pdf bib
Team Iterate @ AutoMin 2023 - Experiments with Iterative Minuting
František Kmječ | Ondřej Bojar

This report describes the development of our system for automatic minuting created for the AutoMin 2023 Task A. As a baseline, we utilize a system based on the BART encoder-decoder model paired with a preprocessing pipeline similar to the one introduced by the winning solutions at AutoMin 2021. We then further explore the possibilities for iterative summarization by constructing an iterative minuting dataset from the provided data, finetuning on it and feeding the model previously generated minutes. We also experiment with adding more context by utilizing the Longformer encoder-decoder model and finetuning it on the SAMSum dataset. Our submitted solution is of the baseline approach, since we were unable to match its performance with our iterative variants. With the baseline, we achieve a ROUGE-1 score of 0.368 on the ELITR minuting corpus development set. We finally explore the performance of Vicuna 13B quantized language model for summarization.

pdf bib
Darbarer @ AutoMin2023: Transcription simplification for concise minute generation from multi-party conversations
Ismaël Rousseau | Loïc Fosse | Youness Dkhissi | Geraldine Damnati | Gwénolé Lecorvé

This document reports the approach of our team Darbarer for the main task (Task A) of the AutoMin 2023 challenge. Our system is composed of four main modules. The first module relies on a text simplification model aiming at standardizing the utterances of the conversation and compressing the input in order to focus on informative content. The second module handles summarization by employing a straightforward segmentation strategy and a fine-tuned BART-based generative model. Then a titling module has been trained in order to propose a short description of each summarized block. Lastly, we apply a post-processing step aimed at enhancing readability through specific formatting rules. Our contributions lie in the first, third and last steps. Our system generates precise and concise minutes. We provide a detailed description of our modules, discuss the difficulty of evaluating their impact and propose an analysis of observed errors in our generated minutes.

pdf bib
Team NTR @ AutoMin 2023: Dolly LLM Improves Minuting Performance, Semantic Segmentation Doesn’t
Eugene Borisov | Nikolay Mikhaylovskiy

This paper documents the approach of Team NTR for the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023. The goal of this work is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced by an Automated Speech Recognition (ASR) system (Task A). We consider minuting as a supervised machine learning task on pairs of texts: the transcript of the meeting and its minutes. We use a two-staged minuting pipeline that consists of segmentation and summarization. We experiment with semantic segmentation and multi-language approaches and Large Language Model Dolly, and achieve Rouge1-F of 0.2455 and BERT-Score of 0.8063 on the English part of ELITR test set and Rouge1-F of 0.2430 and BERT-Score of 0.8332 on the EuroParl dev set with the submitted Naive Segmentation + Dolly7b pipeline.

pdf bib
Overview of the Second Shared Task on Automatic Minuting (AutoMin) at INLG 2023
Tirthankar Ghosal | Ondřej Bojar | Marie Hledíková | Tom Kocmi | Anna Nedoluzhko

In this article, we report the findings of the second shared task on Automatic Minuting (AutoMin) held as a Generation Challenge at the 16th International Natural Language Generation (INLG) Conference 2023. The second Automatic Minuting shared task is a successor to the first AutoMin which took place in 2021. The primary objective of the AutoMin shared task is to garner participation of the speech and natural language processing and generation community to create automatic methods for generating minutes from multi-party meetings. Five teams from diverse backgrounds participated in the shared task this year. A lot has changed in the Generative AI landscape since the last AutoMin especially with the emergence and wide adoption of Large Language Models (LLMs) to different downstream tasks. Most of the contributions are based on some form of an LLM and we are also adding current outputs of GPT4 as a benchmark. Furthermore, we examine the applicability of GPT-4 for automatic scoring of minutes. Compared to the previous instance of AutoMin, we also add another domain, the minutes for EU Parliament sessions, and we experiment with a more fine-grained manual evaluation. More details on the event can be found at