PGTask: Introducing the Task of Profile Generation from Dialogues

Recent approaches have attempted to personalize dialogue systems by leveraging profile information into models. However, this knowledge is scarce and difficult to obtain, which makes the extraction/generation of profile information from dialogues a fundamental asset. To surpass this limitation, we introduce the Profile Generation Task (PGTask). We contribute with a new dataset for this problem, comprising profile sentences aligned with related utterances, extracted from a corpus of dialogues. Furthermore, using state-of-the-art methods, we provide a benchmark for profile generation on this novel dataset. Our experiments disclose the challenges of profile generation, and we hope that this introduces a new research direction.


Introduction
Building conversational systems that mimic human attributes has always been a long-term goal in Natural Language Processing (NLP). Various works have attempted to leverage speaker profile information to improve the consistency of dialogue generation models (Wu et al., 2020;Xu et al., 2022;Cao et al., 2022). However, for dialogue systems, this sort of information is scarce and requires annotation efforts that are expensive to obtain, so there is a need to build methods that automatically gather this knowledge from dialogues. Zhang et al. (2018) introduced PersonaChat, a dataset comprising a collection of profile sentences (persona) that reflect each speaker's individual characteristics and personal facts. These profiles serve as a knowledge base for promoting the consistency between utterances from speakers, and various recent dialogue models have incorporated this information using diverse techniques (Song et al., 2020(Song et al., , 2021Cao et al., 2022).
Few works have attempted to infer profile information from PersonaChat dialogues. Gu et al. (2021) restructured PersonaChat and built the Persona Detection Task, where the goal was to retrieve the correct persona amongst a set of distractor personas. Although introducing an interesting research path, this task is limited to a set of pre-defined personas, which is not suitable for extracting profile sentences from unseen conversational data. Cao et al. (2022) also manipulate Per-sonaChat to incorporate model-agnostic personas into the dialogue generation task. Nevertheless, for the profile generation task, PersonaChat is structured in a profile-to-dialogue manner and lacks information about the corresponding profile sentence per turn, which may become a challenge when the task becomes extracting profile information from utterances.
In this work, we introduce the PGTask, where the goal is to generate profile sentences given speaker utterances. For this, we create a new dataset, the Profile Generation Dataset (PGDataset) 1 , which relates utterances with the respective profile sentences upon the existing PersonaChat corpus. In Figure 1, we can observe several examples of relations between profile sentences and the corresponding speaker's utterance. Notice, however, that the task is more challenging than just finding, within the dialogues, utterances that highly relate to each profile sentence. For instance, the profile sentence "I like all genres of music." is probably at the origin of the utterance "Yes, sometimes I also listen to classical music.", but we cannot extract that profile sentence from that single utterance (the goal of PGTask).
We framed our problem as an entailment classification task and, after human feedback, we reached the final PGDataset. Finally, we provide results from three state-of-the-art models trained and evaluated in the proposed dataset.

Modeling Entailment Relations
In the Natural Language Inference (NLI) task, the goal is to classify the relationship between a pair of premise and hypothesis sentences into three classes: entailment (E), neutral (N) and contradiction (C). Welleck et al. (2019) extended the NLI task to the dialogue setting and introduced the Dialogue Natural Language Inference (DNLI) dataset, where the input sentences consist of dialogue utterances from PersonaChat. We adopt this procedure and train a model M N LI to identify the correct profile sentences for each utterance in a dialogue.
Consider two sentences s i and s j that are concatenated into the input x = {s i , s j }. First, we utilize RoBERTa  to obtain a hidden representation h from the input x. Then, we include a softmax classifier on top of RoBERTa to obtain the probability distribution over the set of possible classes. Formally, we obtain the probability of label y ∈ {C, N, E} with: where W is the learnable parameter matrix from the classification layer. We fine-tune both RoBERTa and W parameters by maximizing the log-probability of the correct label.
We experiment with two different settings where we fine-tune RoBERTa only on DNLI and on

Datasets
Accuracy (%) DNLI 91.24 MNLI + DNLI 91.75 MNLI (Williams et al., 2018), a benchmark multigenre NLI dataset, and DNLI datasets for better generalization. Details are provided in Appendix A. Table 1 shows the results on the test set, where the latter achieves higher accuracy and is selected as the annotation model.

Dataset Annotation
In PersonaChat (Zhang et al., 2018), each dialogue carries a set of profile sentences for both speakers. Consider a set of n utterances from a speaker, U = {u 1 , u 2 , ..., u n }, a set of k profile sentences P = {p 1 , p 2 , ..., p k } from the same speaker, and the dialogue NLI model from Section 2.1. Then, at time step t, we can determine one or more profile sentences s t related to utterance u t using: In Equation 2, the profile sentences are gathered by considering the entailed cases between the utterances and the profile sentences, where each utterance could be associated with more than one profile sentence. In Table 2, we provide an extract from the PGDataset.

Utterance
Profile Sentences I enjoy hanging with my mother she is my best friend.
My mom is my best friend.
I am almost done, I only have two years left in law school.
I have got two more years in college. I study law.

Human Annotations
In the profile generation task, the profile must represent a possible extraction from the dialogue utterance, and this correlation's direction between the utterance and the profile sentence must be valid.
To assess the quality of the automatic annotations from our model, we resort to human evaluation. For all the pairs classified as entailed in Equation 2, we measure the confidence by inspecting the softmax probability assigned to the entailment class. Our intuition is that a weak confidence when classifying a profile sentence as entailed corresponds to a weak or incorrect correlation and can be removed from the dataset. In Figure 2, we plot the distribution of the scores from the entailment class for all points obtained from Equation 2.
To determine if a higher confidence value corresponds to a correct example, we randomly select 100 samples from 3 intervals: [50, 70], ]70, 90], and ]90, 100]. We asked 3 expert annotators from our department to "mark with an X if the profile sentence could be extracted from the given utterance". The quality of the samples is measured by the number of marked samples by the annotators (accuracy). The agreement rate between annotators was 86.66% and the average accuracy for each interval was 8.33%, 12.33%, and 51.67%, respectively. The results obtained show that when the confidence of the model grows, the correlation between the profile sentence and the utterance also increases.
After inspecting the results from the annotators, we observed that most of the marked samples had more than 99% confidence. We asked for a second round of annotations with 100 samples but now only for samples with more than 99% confidence. The agreement rate between annotators was 91% and the average accuracy was 87,33%, a significantly higher score compared to the ]90, 100] interval. We decided, thus, that PGDataset only  considers the samples which the model classified with more than 99% confidence.

PGDataset Statistics
In Table 3, we provide the dataset statistics for the gathered samples.

Benchmarking the PGTask
In this task, the goal is to generate a profile sentence conditioned on an utterance. Transformer-based decoders have achieved substantial progress in various NLP tasks (Radford et al., 2019). We leverage these models and rely on a causal language modeling (CLM) objective for our profile generation task. More precisely, considering a sentence s = {w 1 , ..., w n } composed of n words, in CLM, the maximum likelihood objective over s is: log P (w i |w 1 , ..., w i−1 ). (3) For our task, we are only interested in calculating the loss for the words from the profile sentence conditioned on the utterance. Considering an utterance u = {w u 1 , ..., w u m } and a profile sentence p = {w p 1 , ..., w p k }, we redefine the objective from Equation 3: log P (w p i |w u 1 , ..., w u m , w p 1 , ..., w p i−1 ).
(4) As seen in Equation 4, the loss is only calculated for the generation of the profile sentences. In the model's input, we separate the utterance and profile sentences using a special token <gen> and, as it can  Table 4: Generation results for models with and without fine-tuning (FT) on the PGDataset. The results presented are the average score of 5 runs. The scores range between 0 and 1. exist more than one profile sentence, we add <sep> between the profile sentences.

Experiments
In this section, we evaluate Transformer decoders on the novel dataset and provide benchmark results for future research. Additional experimental details are provided in Appendix B.1.

Models
GPT2 This model has achieved state-of-the-art results in various generation tasks (Radford et al., 2019). We consider two different pre-trained versions that differ in size, the gpt2-small and gpt2-medium (details in Appendix B.2).
DistilGPT2 This is a distilled version of GPT2, where it was trained under the supervision of GPT2 (Hinton et al., 2015). The distilgpt2 contains about half the size of GPT2 while still achieving competing performance in various NLP tasks.

Metrics
We follow common practices for text generation and report BLEU (Papineni et al., 2002) and ROUGE (Lin and Hovy, 2002), metrics that, respectively, measure the precision and recall between the generated and the golden text. Additionally, we employ BERT Score (Zhang et al., 2019), an automatic metric that leverages BERT's (Kenton and Toutanova, 2019) contextual embeddings and matches words in candidate and golden sentences using cosine similarity.

Results
In Table 4, we provide benchmark results for the PGTask. We observe that fine-tuning the models has a great impact on the overall performance, where gpt2-small achieves the higher scores in all metrics except BERTScore (for a minimal difference). In Appendix B.3, we provide some generated examples from the evaluated models.

Related Work
Recent research has focused on building personalized dialogue systems using profile information. Li et al. (2016) proposed a neural conversational model to capture background information and speaking style from interlocutors in dialogue. Zhang et al. (2018) introduced a dataset composed of personas, which are essentially 3 to 5 profile sentences describing the speaker's profile. Zheng et al. (2019) studied how to include profile information such as age, location, and interests by explicitly incorporating this knowledge into the sequence-tosequence framework. Few works have attempted to identify profile knowledge from conversational data. (Gu et al., 2021) introduced a framework for detecting the correct profile amongst a set of distractor profiles. Nevertheless, the authors do not consider the correlation between utterances and profile sentences. (Cao et al., 2022) proposed a data manipulation method to construct distilled and diversified dialogue data containing profile information and leverage it into the dialogue generation task.

Conclusion
We propose the PGTask and contribute with PG-Dataset, a dataset with more than 30 000 pairs of utterances and profile sentences built with the feedback of human annotators. In addition, we train state-of-the-art models and achieve promising results in the proposed task. We hope that this new line of research will help the task of personalizing dialogues, although the task of automatically extracting profiles from dialogues is valuable by itself.

A Fine-Tuning RoBERTa
We fine-tune a pre-trained roberta-base 2  with 12 layers, 768 hidden units, 12 attention heads, and 125M parameters on 1 NVIDIA GeForce RTX 3080 to minimize the cross entropy. We use Adam (Kingma and Ba, 2014) optimizer with a learning rate of 5e − 5. The batch size was 32, we train for 20 epochs and early stop after 5 epochs without an increase in the validation accuracy.

B.1 Experimental Details
We perform 5 runs for each model on 1 NVIDIA GeForce RTX 3080 using different seed values and calculate the average score for all metrics. Models are trained to minimize the cross entropy using Adam (Kingma and Ba, 2014) optimizer with a learning rate of 5e − 5. For gpt2-small and distilgpt2, the batch size was 16 while for gpt2-medium the batch size was 4 with 4 gradient accumulation steps. We train for 20 epochs with early stopping where the training is stopped after 5 epochs without a decrease in the validation loss. We generate the profile sentences with a maximum length of 50 and perform greedy sampling, i.e., select the next word with the highest probability. All experiments are implemented using the Hugging-Face 3 and PyTorch 4 libraries.
However, the evaluation scores also show that the systems are still far from always extracting the correct profiles. In Table 6, we show an example where all models generated unrelated profile information and failed to recognize that the speaker dropped off high school.
Dialogue Utterance: sorry, i do not like music, i like reading mystery books.
Golden Profile Sentences: i enjoy reading mysteries.

distilgpt2:
i love to read mystery novels. gpt2-small: i love reading mysteries in my free time. gpt2-medium: i read mystery novels. Dialogue Utterance: dropping out of high school was a bad idea. the landlord just called.
Golden Profile Sentences: i dropped out of high school.

distilgpt2:
i just got a job at the elementary school in new england. gpt2-small: my parents got a new job. gpt2-medium: i just graduated high school. Table 6: Generated example #2 from fine-tuned models.