GupShup: Summarizing Open-Domain Code-Switched Conversations

Code-switching is the communication phenomenon where the speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multi-lingual communities worldwide. Therefore, it is essential to develop techniques for understanding and summarizing these conversations. Towards this objective, we introduce the task of abstractive summarization of Hindi-English (Hi-En) code-switched conversations. We also develop the first code-switched conversation summarization dataset - GupShup, which contains over 6,800 Hi-En conversations and their corresponding human-annotated summaries in English (En) and Hi-En. We present a detailed account of the entire data collection and annotation process. We analyze the dataset using various code-switching statistics. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation. Our results show that multi-lingual mBART and multi-view seq2seq models obtain the best performances on this new dataset. We also conduct an extensive qualitative analysis to provide insight into the models and some of their shortcomings.


Introduction
Conversation summarization is the process of generating a condensed version of a given conversation while preserving the most salient aspects. With the extensive use of various chat applications such as messaging apps and virtual assistants (Klopfenstein et al., 2017), there has been a growing interest in * Emails ids of the corresponding authors are, lm4428@nyu.edu, Debanjan.Mahata@moodys.com, rajivratn@iiitd.ac.in. Authors 1,2,3, and 4 have equal contributions. Debanjan Mahata participated in this work as an Adjunct Faculty at IIIT-Delhi. the abstractive summarization of written conversations (Mehdad et al., 2014;Goo and Chen, 2018;Zhao et al., 2019). Automatic conversation summarization has potential applications in various fields such as healthcare (Song et al., 2020), call centers (Alam et al., 2016), education (Joshi and Rosé, 2007), and many other areas (Feng et al., 2021).
One of the biggest challenges in conversation summarization has been the lack of large datasets with human-annotated summaries. Most researchers evaluate their summarization techniques on transcriptions of AMI (Carletta et al., 2005), or ICSI meeting corpus (Janin et al., 2003) using the meeting topics as summaries. These corpora are very useful for various speech-related research problems, but they do not represent written conversations in chat applications. Recently, (Gliwa et al., 2019) published the SAMSum corpus, which contains over 16,000 written English conversations and their corresponding manually annotated summaries. Though these conversations were not extracted from actual chat applications, they were created by linguists to replicate natural conversations. To our knowledge, this is the largest summarization dataset for written conversations. Table 1: Example of a code-switched Hi-En conversation and the corresponding En summary. : En words, : transliterated Hi words, : language-agnostic words such as named entities and punctuation marks The SAMSum corpus is monolingual (English); therefore any model trained on this dataset may not adapt effectively to multi-lingual, especially code-switched conversations where speakers alternate between different languages within the scope of a conversation or even an utterance (Gumperz, 1977;Muysken et al., 2000;Myers-Scotton, 1997). Code-switching is commonly observed during interactions between peers who are fluent in multiple languages. For example, in the Indian subcontinent, it is common for people to alternate between English and other regional languages like Hindi over the course of a single conversation. Code-switching is an integral part of both written and spoken conversations for various multi-lingual communities across the world (Auer, 2013). Developing models that can accurately process code-switched text is essential to the proliferation of NLP technologies to these communities and contributes towards the diversity and inclusivity of language resources (Joshi et al., 2020). However, building such models would require high-quality human-curated datasets. This paper introduces the task of abstractive summarization of open-domain code-switched written conversations. Namely, given a multi-party conversation in Hi-En, the objective is to generate a summary in English, as shown in Table 1. A multiparty code-switched conversation C is a sequence of n utterances {u 1 , u 2 , ..., u n }, where the i th utterance u i is written by p j one of the k participants. The utterance u i is a sequence of m tokens {x 1 , x 2 , ..., x m }, where the tokens could either be in English or transliterated from Hindi. The goal of the code-switched conversation summarization task is to generate an English summary that captures the most salient aspects of the conversation. These English summaries could serve as input to other downstream NLP models, often trained on English data, to perform various tasks such as intent classification, question answering, and item recommendation. To facilitate this task, we present a new corpus named GupShup 1 , which contains over 6,800 Hi-En 2 code-switched conversations and corresponding human-annotated summaries in En and Hi-En. We build this dataset by manually translating a subset of conversations and summaries from the SAMSum corpus (Gliwa et al., 2019) from 1 The word 'gupshup' means conversations for fun in Hindi and Urdu languages. The proposed dataset in this paper is named as 'GupShup' for reflecting the nature of the content in the conversations and has no connection with the messaging platform gupshup.io.
2 Throughout the paper we denote code-switched Hindi English conversations by Hi-En. This is also popularly known as Hinglish.
En to Hi-En. This effort has not only developed the first code-switched conversation summarization corpus but also a parallel corpus of En and Hi-En conversations with 76,330 utterances. Following are some of the main contributions of this work: • We present the first open-domain code-switched conversation summarization dataset -GupShup 3 with over 6,800 Hi-En conversations and their corresponding annotated summaries in En and Hi-En.
• We characterize the complexity of this dataset through various code-switching statistics.
• Through rigorous experimental work, we benchmark the performances of various state-of-the-art abstractive summarization models.
• We perform a thorough human evaluation and qualitative analysis to provide insight into the strengths and shortcomings of different language models when dealing with code-switched data.

Background
In the linguistics community, code-switching typically refers to the change of language or grammatical systems from one utterance to another within the same conversation (Gumperz, 1982). On the other hand, code-mixing refers to the use of linguistic units such as phrases, words, or morphemes of one language in the utterance of another language (Myers-Scotton, 1997;Myers-Scotton et al., 2002). In other words, code-switching is an interutterance, and code-mixing is an intra-utterance phenomenon. However, in this paper, we use the term code-switching to refer to both these concepts.
Code-switching has started to gain some traction from computational linguists over the last few years (Barman et al., 2014b;Bali et al., 2014), where they developed datasets for many interesting problems such as language identification (Das and Gambäck, 2014), part of speech tagging (Barman et al., 2014a), question answering (Chandu et al., 2015), and named entity recognition (Singh et al., 2018). Additionally, researchers have also started to develop objective metrics that characterize the complexity of code-switching in a given corpus (Gambäck and Das, 2016;Guzmán et al., 2017). For a thorough review of datasets and other developments in this space, we recommend the review paper from (Sitaram et al., 2019).
Most of the code-switched datasets typically contain individual posts or comments from social media applications like Twitter and Facebook anno-

Dataset
Conversational Task (Das and Gambäck, 2014) Language identification and POS tagging (Barman et al., 2014a) POS tagging (Chandu et al., 2015) Question Answering (Jamatia et al., 2015) Language identification (Jamatia et al., 2016) Language identification (Banerjee et al., 2016) Question Answering (Chakma and Das, 2016) Information Retrieval (Patro et al., 2017) Language identification (Bohra et al., 2018) Hate-speech Text Classification (Gupta et al., 2018) Question Answering (Banerjee et al., 2018) Close domain conversation system (Chandu et al., 2018) Question Answering (Patra et al., 2018) Sentiment analysis (Singh et al., 2018) Named Entity Recognition. (Bhat et al., 2018) Dependency parsing (Bawa et al., 2018) Accommodation quantification (Khanuja et al., 2020) Natural Language Inference GupShup (This work) Open domain conversation summarization  (Margaret et al., 2014), which contains audio recordings and transcripts of informal Spanish-English (Sp-En) multi-party conversations, (2) COMMONAMIGOS corpus (Ahn et al., 2020), which contains 587 Sp-En conversations between human users and a dialogue system, (3) DSTC2 corpus (Banerjee et al., 2018), which contains Hi-En translations of the restaurant reservation dataset. As shown in Table 2, there are only three datasets with Hindi-English code-switched conversations, but none of them contain summaries. A large majority of research in abstractive summarization focuses on news articles (Hermann et al., 2015;Grusky et al., 2018a;Narayan et al., 2018a) and scientific papers (Cohan et al., 2018), because of the availability of large benchmark datasets. The task of summarizing open-domain multi-party conversations has not been investigated until recently with the introduction of SAMSum (Gliwa et al., 2019). (Chen and Yang, 2020) obtained state-of-the-art results on this corpus with their multi-view sequence-to-sequence model, which extracts different views of the conversation, encodes them in an LSTM-based model, and then uses an attention-based mechanism in the decoding phase to generate the summary.
Conversation summarization is a challenging research problem because conversations are filled with various complex linguistic phenomenon such as informality, verbosity, interruptions, backchanneling, reconfirmations, hesitations, and many implicit connotations (Sacks et al., 1978). Therefore, it is difficult for current summarization approaches to identify the most relevant and salient aspects of a conversation (Chen and Yang, 2020). The code-switched nature of our data poses additional challenges. We believe the task's challenging nature would encourage the development of better multi-lingual language models and facilitate a new direction of research in the area of summarization, leading to innovations in modeling architectures, especially for code-switched text.

Data Collection
The goal of our annotation process is to build a Hi-En code-switched conversation summarization dataset. We first considered the option of creating summaries for an existing code-switched conversational dataset like DSTC2 (Banerjee et al., 2018). Though this dataset is substantially large with over 50,000 utterances, it is not open-domain and only contains human-computer conversations focused on restaurant reservations, thus lacking linguistic diversity. We, therefore, chose the option of manually translating the SAMSum corpus (Gliwa et al., 2019) from En to Hi-En. Annotation Process -We hired eight annotators who are fluent in both Hindi and English. We first explained the concept of code-switching and provided them with a few reference examples annotated by the authors. Based on our interactions with the annotators, we observed that Hi-En codeswitching was an integral part of their vernacular. They also frequently used code-switching on social media and chat applications.
We first provided each annotator with a random sample of ten conversations. We instructed them to first go through the conversation and the corre- sponding summary in English and then translate the content to Hi-En, assuming it was an interaction between themselves and their friends. However, they could not introduce or remove any utterances during translation but rather do an utterance by utterance translation. We asked the annotators to transcribe the resulting conversations only in Romanized text: transliterate the Hindi words. They used the same process for translating the summaries.
After the annotators completed translating the initial random samples, we provided feedback in terms of format and organization of the data. Once they were comfortable with the process, we assigned them random batches of conversations, and they worked independently based on their schedules. The contributions of each annotator were mainly driven by their availability. Due to time and resource constraints, we only had one translation for a given source conversation. The entire annotation process lasted for around three months, at the end of which we translated 6,831 conversations containing 76,330 utterances. To our knowledge, this is also the largest parallel corpus for Hi-En and En languages containing 109,346 sentences, out of which 48,578 are code-switched. Table 3 shows a sample conversation and summary in English and the corresponding codeswitched translations. As demonstrated in this example, the annotators preserved the use of punctuation and emojis in the original text. This sample also demonstrates different types of code-switching patterns. For example, in the first utterance, an English term office is inserted in the middle of a transliterated Hindi sentence. The same applies to the phrase 1 hour in the second utterance. In the fourth utterance, the speaker Mary switches from Hindi to English in the middle of an utterance. In the last utterance, the speaker Karen switched entirely to English though they initiated the conversation in Hi-En. Hindi and many other Indian languages exhibit inflection of verbs based on the gender of the speaker (Kumar et al., 2019). For example, the sentence I read, when said by someone of masculine gender in Hindi, would be: main padhtaa hoon, but when said by someone of the feminine gender would be: main padhtee hoon. The verb padh was inflected by the gender of the speaker. During the annotation process, we did not provide any explicit instructions about this inflection, but the annotators used the speaker's names or conversational context to derive the gender and used that information for translation. For example, the term hounga in the second utterance of the conversation in Table 1 is a reflection of John's perceived masculine gender. Likewise, the term sakti in the sixth utterance reflects Mary's perceived gender. Figure 1 shows the distribution of the number of utterances per conversation. Gupshup has 76,330 utterances, where the shortest conversation has one utterance and the longest has 31 utterances. The average length of a conversation is 11.17 utterances. The English version of the corpus had 28.6 words per utterance and 19,252 unique words, whereas GupShup has 31.1 words per utterance and 25,865 unique words. A large portion of the data (73.6% of conversations) had only two participants, 18.7% of the conversations had 3 participants, 5.2% had 4 participants, and the remaining conversations had more than 4 participants. Language Tagging -Since GupShup is a codeswitched corpus, it is essential to analyze and quantify the complexity of code-switching in the corpus, which requires us first to identify the language associated with each token. Since this is a parallel corpus, we used information from the source utterances to determine non-English tokens in the translated utterance. More specifically, for a given pair of utterances, we first removed all the punctuation marks, emojis, and numerical tokens. We then identified named entities in the English utterance, and if these entities also appeared in the translated utterance, the corresponding tokens were excluded from language tagging. Of the remaining tokens, if any of them appeared in the English utterance, we tagged them as English.

Corpus Analysis
We repeated the same process on the remaining tokens but with their lemmatized versions because we observed that sometimes the annotators used the English word from a source utterance in a different part of speech. For e.g. the phrase "a few years after we graduated" was translated to "graduation ke kuch saal baad". Here graduation can be accurately tagged as English using lemmatization. Lastly, the remaining tokens in the Hi-En utterance were tagged as Hindi.
We observed that this process captured most English tokens in the translated utterances, except when the annotators introduced a new English token. For example, one annotator translated the phrase "I have been thinking for a while" to "Main kaafi time se soch rahi hu", where they introduced the token time. However, we observed that this was a rare phenomenon and did not sway the analysis significantly.
Code-switching statistics -Per the language tagging approach described above, 18.15% (13,760)   code-switched: had a combination of Hindi and English tokens. Table 4 has more detailed codeswitching statistics of the corpus. We determine the matrix language (Myers-Scotton et al., 2002) 4 of a sentence using the heuristics proposed in (Dhar et al., 2018). Namely, we define a sentence as Hindi if (a) the majority of tokens are Hindi, (b) we detect the use of any Romanized Hindi verbs, or (c) we detect the use of Romanized Hindi bi-grams. Per this definition on the 43,407 code-switched utterances, the matrix language of 45,644 sentences was Hindi, and 2,934 sentences were in English.
Based on the matrix language and token-level language tags, we further analyzed the codeswitching complexity using the metrics C avg (Gambäck and Das, 2016), C c (Banerjee et al., 2018), and I-index (Guzman et al., 2016). These metrics quantify complexity in terms of the number of foreign language tokens and switch points. On our corpus, we estimated that C c = 63.25, C avg = 13.57, and I-index was 0.14. For reference, (Gambäck and Das, 2016) applied these metrics to the different code-switching datasets from (Solorio et al., 2014) and observed that English-Nepalese corpus had the highest levels of switching with C c value of 49.06 and C avg value of 7.98. These metrics show that GupShup has high levels of code-switching complexity.

Empirical Benchmarks
In our experimental work, we employ the following six abstractive summarization models: GPT-2, BART, PEGASUS, T5, multitask T5, mBART, and multi-view seq2seq model (Chen and Yang, 2020). Most of these transformer models were pre-trained on English corpora, except for mBART, which was trained on multilingual data. We expect our results to serve as empirical benchmarks for future researchers. We used 5,831/500/500 conversations as our train, dev, and test splits, respectively. We trained all the models for three epochs, with evaluation on the dev set after each epoch. Due to constraints on computing resources, we used a smaller mBART model with a 12-layer encoder and a 6-layer decoder. For the T5 model 5 , we tried both fine-tuning and multitask learning approach (T5 MTL); for the latter approach, we trained the model for both summarization and translation. We ran this entire process for three experimental setups: (1) En summaries from Hi-En conversations, (2) En summaries from En conversations, and (3) Hi-En summaries from Hi-En conversations. All models were trained using Huggingface's transformer library on Google colab GPU enabled platform. The model performances are reported in terms of the following automatic evaluation metrics: ROUGE (R1, R2, RL) (Lin, 2004), BLEURT (Sellam et al., 2020), BERT-score (Zhang et al., 2020), BLEU (Papineni et al., 2002), and METEOR (Banerjee and Lavie, 2005). Results -The results are summarized in Table 5. In generating English summaries from Hi-En conversations, the mBART model obtained the best R1 and R2 scores; this is likely because it is the only model to have been trained on multiple languages and therefore better understands code-switching points in the conversation. The multi-view model obtained the best RL, BLEURT, and BLEU scores. Though this model was pre-trained only on English data, it explicitly extracts conversational structure to help with the summary generation. The T5 MTL model outperformed the T5 model across all metrics and also achieved the best ME-TEOR score. We applied the Wilcoxon-signed rank 5 We also trained mT5. However, even after training for 30 epochs with early stopping we barely got a ROUGE-1 score of 31.06, ROUGE-2 score of 9.41 and ROUGE-L score of 25.33, which were lesser than what we obtained using T5. On manually analyzing the results, we observed that the summaries produced by mT5 were not grammatically sound, there were lot of repetitions, the entities mentioned in the summaries were wrong, and relevant points from the conversations were missing in the summaries. This is very surprising. Since we decided to keep only the top performing models in the final results table and could not understand the reason behind such poor performance of mT5 we did not pursue it further.  test for the ROUGE-1 scores of the two T5 models and obtained a p-value of 7.1e-16 at a confidence level of 5%, indicating a statistically significant difference. This observation suggests that summarization task would benefit from translation task due to the multi-lingual and parallel nature of the dataset. Therefore, we expect the other models could also benefit from a multi-task learning setup.
In the second experimental setup, where English summaries were generated from English conversations, we observed a significant improvement for all the models and across all the metrics. This observation is expected as most of these models have been pre-trained on English data. The multi-view model obtained the best R1, BLEU, and METEOR scores, whereas the PEGASUS model obtained the best R2, RL, and BLEU scores. Once again mBART obtained better scores than BART, but the difference is less pronounced than for Hi-En conversations. It is unclear if the superior performance of mBART is due to multilinguality or because it was pre-trained on more English data. In the future, we would like to explore the use of Devanagari script for the portions of the conversations that are in Hindi as opposed to the Romanized script since models like mBART have been exposed to Devanagari script.
In both these experimental setups, we observed that GPT-2 model obtained the lowest scores, but further analysis helped us realize that the summaries generated by this model were not semantically meaningful and often contained repeated words. This could be attributed to the model's architecture which contains only a decoder and can-not, therefore, create a meaningful representation of the source conversation (Rothe et al., 2020). Additionally, we also observed minimal variance in BERTScore values across the models, suggesting it may not be the most effective metric for measuring the quality of summaries.
In the third experimental setup, we generated Hi-En summaries from Hi-En conversations. Here we observed a steep drop in performance of all the models when compared to generating English summaries. However, this drop was less prominent for BART, which obtained the best scores across most metrics. Another interesting observation is that, despite being multilingual, mBART generated poor Hi-En summaries. Overall, from this experiment, we observe that most language models may not be capable of generating Hi-En summaries. Human Evaluation -We conducted a human evaluation of the English summaries generated by all the models for a random selection of 100 Hi-En conversations. Following (Fabbri et al., 2020), we had three annotators rate the summaries for the following four metrics 6 : consistency, coherence, fluency, and relevance on a Likert scale of 1 to 5. The results, average Likert score normalized to the range 0 to 1, are summarized in Table 6.
The multi-view model obtained the best scores for consistency and relevance, whereas the PEGA-SUS model obtained the best scores for coherence and fluency. The mBART model also obtained comparable scores. These observations are reasonably consistent with the numbers in Table 5 except for PEGASUS, which may not have obtained the highest scores in automatic metrics but generated more coherent and fluent summaries.  Given the variety of automatic metrics for measuring the quality of summaries, we also wanted to understand how they correlate with human judgments. Table 7 summarizes the Spearman corre-lation between all the automatic metrics and human evaluations. All the ROUGE-based metrics and BLEURT were strongly correlated with human evaluations, but METEOR had a very low correlation. Perhaps, as a next step, we could consider fine-tuning BLEURT on this dataset to serve as a measure of quality for summaries from codeswitching conversations.

Qualitative Analysis
Error Analysis -To get a better insight into the generated summaries, we chose a random sample of 100 conversations both English and Hi-En codeswitched, and analyzed English summaries generated by all the models. We classified each summary into one or more of the following four error categories: (a) missing information (MI): the generated summary is missing a salient aspect mentioned in the reference summary, (b) erroneous reference (ER): one of the speakers is incorrectly associated with an action or location, (c) incorrect inference (II): summary contains a statement that cannot be inferred or reasoned from the conversation, and (d) wrong pronouns (WP): summary misrepresents a speaker's gender. These classes were not predetermined, but grew out of the error analysis. Table 8 shows an example of a summary generated from Hi-En conversation that has different types of errors. The results are summarized in Table 9. As with automatic metrics (Table 5), the error analysis also shows that the models summarize English conversations better. We observe that all models miss important information when summarizing; however, this was more pronounced with Hi-En conversations suggesting code-switching adds more complexity to the summarization process. Interestingly, the T5 model has the fewest MI errors when summarizing English conversations. The multi-view model had the fewest ER errors, likely because it explicitly tracks speakers and therefore Randy: Natalie, we're dead, look what happened!? Natalie: What's wrong? Randy: <file_video> Natalie: No way, how did Sally and Molly get to the parent's bedroom?!! Randy: I've no idea, someone must have left the door open Natalie: It looks really nasty! Why are all the sheets so dirty? Randy: I think they both went to play in the mud and then somehow ended up here Natalie: Jesus Christ, mum is going to be really mad! Start cleaning it Randy: I know, but it's hopeless, parents might here any minute Natalie: Just do it, I'll be back as soon as I can! Reference summary: Sally and Molly possibly played in the mud and got to parent's bedroom, where they made a mess. Randy will start cleaning and Natalie will join him as soon as she can, because parents might be here any minute. Summary from Hi-En conversation: Natalie's parents are dead. Randy thinks Sally and Molly are in their parents bedroom. Natalie thinks they're dead. Summary from En conversation: Sally and Molly got into the parent's bedroom. The sheets are dirty. Randy thinks they went to play in the mud and then ended up here.  less likely to associate them with incorrect actions. The II errors have increased significantly for all the models when summarizing Hi-En conversations, further demonstrating the inability of the models to understand code-switched conversations. We noticed that even though the models select salient aspects of the conversations, the summaries have incorrect inferences because some of the critical tokens in the conversation were code-switched. Summary Compression -We measured the compression ratio of the summaries generated by all models for both En and Hi-En conversations as show in figure 2. Per (Grusky et al., 2018b), compression ratio is defined as the ratio of the number of tokens in the conversations to that of the summaries. We observe that all the models seem to generate shorter summaries for Hi-En conversations. This is understandable considering that most of these models are pre-trained on English data; therefore, when exposed to Hi-En conversations, the English content becomes less frequent and frag- mented, resulting in the models under-generating or even stopping early. Compared to reference summaries, BART, mBART, and PEGASUS seem to generate more compressed summaries, whereas T5 and GPT-2 generate longer summaries. This could also explain why T5 has fewer MI errors. Quantifying Abstractiveness -During our error analysis, we also observed that the summaries generated from English conversations were often more extractive than abstractive, where the models were selecting the most salient phrases in the source conversations. Following (Narayan et al., 2018b), we quantified the abstractiveness of the summaries by calculating the portion of new n-gram generated by the models for English. Namely, we report the ratio of n-grams that were part of the summaries but not the source conversations to the total number of n-grams in the summaries. When compared to reference summaries, all the models seem to be more extractive in nature. Of the five models here, T5 generated the most novel n-grams despite not introducing many new uni-grams, which is consistent with the low compression ratio of T5.
Model Diversity -Since we have employed different language models in our work, we wanted to understand how similar are the summaries gener-  ated by these models. To this end, we measured the ROUGE-1 and ROUGE-4 scores between the summaries from each pair of models. The results are summarized in Figure 5. We observe that the ROUGE-1 scores are significantly higher than ROUGE-4 scores for both En and Hi-En conversations suggesting that the models seem to be choosing a similar set of tokens for summarization. However, the word order is very different, thus suggesting that the eventual summaries are different. Table 10 shows an example of one such conversation and its summaries. Also, the ROUGE-1 and ROUGE-4 scores between the model are higher than w.r.t the reference summaries. These findings are similar to those reported in (Kryscinski et al., 2019). Lastly, as with reference summaries, the pairwise ROUGE scores are higher for En than Hi-En conversations.

Conclusions and Future Work
In this work, we presented the first code-switched conversation summarization dataset -GupShup, which has over 6,800 multi-party open-domain conversations and their corresponding summaries in En  and Hi-En. We quantified the performance of various state-of-the-art neural models using both automatic metrics and human evaluation. mBART and multi-view models obtained the best scores on automatic metrics, whereas PEGASUS and mBART obtained best scores on human judgement. The T5 model generated relatively longer summaries and was therefore effective in capturing all the salient aspects of the conversation. ROGUE-based metrics and BLEURT were highly correlated with the human judgements but METEOR, BERTScore, and BLEU proved relatively ineffective for this task. We also observed that multi-task learning setup showed promise and we would like to explore this further.
We conducted an extensive qualitative analysis of the summaries, which provided very interesting insights into different language models. In the future, we would like to explore new techniques for overcoming the various challenges that we observed in this work and would like to dive deeper in understanding the failures of different language models when applied to code-switched text.  will help the models in making sense of the conversations, as they are all trained on only English corpus except for mBART. Its worth noting that mBART, which is our only multilingual model has a negligible positive correlation of 0.07, suggesting it's probably using other signals as well, to make sense of the code-switched conversations benefiting from its multilingual pretraining. Fig.6 shows the correlation between the several conversation-summary pair attributes and the performance of the models for generating English summaries from En conversations. We see that PEGA-SUS, BART, mBART, and Multi-view Seq2Seq are all negatively correlated to avg. utterance length of the conversation, number of turns in a conversation, number of turns/number of participants of a conversation, length of the conversation, and the compression ratio of the gold summary with respect to the conversations to be summarized. T5 largely remains uncorrelated with any of the conversationsummary attributes.
A.3 Sample for the Error Analysis performed in section 6 Table 17 shows the difference in the errors in summaries generated from the same conversation, one from the respective En conversation and one from Hi-En code-switched conversation, as mentioned in section 6.
A.4 Hyperparameters for models trained for summarization of Hi-En code-switched conversations to En summaries. Table 18 reports the hyperparameters of the models trained for summarization of Hi-En conversations to English summaries. Table 17: An example of conversation summarization, where the same model summarizes the given example. On the left, the model generates an English summary from the respective English conversation, and on the right, the model generates an English summary from the respective Hindi-English code-switched conversation. Both are erroneous summaries but differ in the error classes that each belong to. Summary A which is generated from the respective English conversation contains sentences that are irrelevant and misses out on relevant information, exhibiting the MI (relevant information is missing) error class as mentioned in section 6. Irrelevant lines in the summary are shown in ( ). On the other hand, summary B which is generated from the respective Hi-En codeswitched conversation also misses relevant information (MI) but also shows a case of instrinsic hallucination.
Phrases exhibiting intrinsic hallucination in the summary are shown in ( ).  Table 18: Training details of summarization models. Figure 6: shows the correlation between the models' ROUGE-2 scores and the attributes of conversationsummary pairs that could affect the performance. This heatmap represents the correlation for English summaries generated from English conversations. Figure 7: Shows the correlation between the models' ROUGE-2 scores and the attributes of conversationsummary pairs that could affect the performance. This heatmap represents the correlation for English summaries generated from Hindi-English conversations.