O R C HI D: A Chinese Debate Corpus for Target-Independent Stance Detection and Argumentative Dialogue Summarization

Dialogue agents have been receiving increasing attention for years, and this trend has been further boosted by the recent progress of large language models (LLMs). Stance detection and dialogue summarization are two core tasks of dialogue agents in application scenarios that involve argumentative dialogues. However, research on these tasks is limited by the insufficiency of public datasets, especially for non-English languages. To address this language resource gap in Chinese, we present O R - C HI D ( Or al Chi nese D ebate), the first Chinese dataset for benchmarking target-independent stance detection and debate summarization. Our dataset consists of 1,218 real-world debates that were conducted in Chinese on 476 unique topics, containing 2,436 stance-specific summaries and 14,133 fully annotated utterances. Besides providing a versatile testbed for future research, we also conduct an empirical study on the dataset and propose an integrated task. The results show the challenging nature of the dataset and suggest a potential of incorporating stance detection in summarization for argumentative dialogue. 1


Introduction
Recent development of large language models (LLMs) have pushed the general interest on dialogue agents to a new level, and increasingly powerful LLMs such as GPT series demonstrated promising capabilities across multiple application scenarios.Among various tasks that have been assigned to dialogue agents, engaging argumentative dialogues (Macagno, 2000;Walton, 2008) has long been a challenging one.Regardless of specific aims, whether winning a debate (Zhang et al., 2016), convincing people (Prakken et al., 2020)  opening up minds (De Kock and Vlachos, 2021;Farag et al., 2022), dialogue agents rely on two of foundation abilities: stance detection and summarization.Stance detection aims to reveal attitudes of arguments, and the goal of summarization is to collect and condense information in order to build arguments.They collaboratively support comprehending and developing arguments, consequently both abilities are crucial for engaging argumentative dialogues (Chen et al., 2017;Lawrence and Reed, 2019;Wang and Wan, 2021;Sanders et al., 2022).
In general natural language processing (NLP), stance detection is to classify the stance ('favor', 'against' and 'none') of a piece of comment with respect to a target entity or claim (Hasan and Ng, 2013;Nguyen et al., 2016;Küçük and Can, 2020;Hardalov et al., 2022).The other task, text summarization aims to compress information from a large piece of text and produce a concise and comprehensible digest (Gillick et al., 2009;Shang et al., 2018).Dialogue summarization, fittingly, takes dialogues as source text.
However, some unique features of argumentative dialogues propose atypical challenges for the two tasks.Regarding summarization tasks, argumentative dialogues, such as debates, meetings and online forum discussions, often contain contradictory utterances with conflicting stances (Zou et al., 2021), making them more convoluted to summarize.Also, in comparison with written text, spoken dialogues naturally carry more noises such as mispronounces, rephrasing and repeated words that obstruct summarizaiton.Meanwhile, unlike typical target-specific stance detection whose targets are explicit entities (e.g., 'Metaverse'), stance detection on argumentative dialogues is targetindependent, meaning that the targets of those studies are claims in the form of complete sentences (e.g., 'The commercial value of metaverse is overestimated').
Despite the progress made on the tasks, research community's effort has been decelerated by an insufficiency of proper language resources.Existing dialogue summarization datasets are dominantly in English (Gliwa et al., 2019;Durmus and Cardie, 2019;Roush and Balaji, 2020;Fabbri et al., 2021;Chen et al., 2021).Regarding non-English summarization datasets, prior resources in Chinese are either focus on one specific domain (Song et al., 2020;Zou et al., 2021;Lin et al., 2021;Huang et al., 2020), or without stance-specific summaries (Feng et al., 2022).There still is a lack of multi-domain and annotated Chinese dialogue summarization datasets.Moreover, most existing stance detection datasets are target-specific regardless of language (Alturayeif et al., 2023).Overall, in terms of Chinese language resources, the amount of datasets suitable for argumentative dialogue summarization is highly limited, and there currently is no benchmark for target-independent stance detection.
To remedy this shortage and facilitate related research, we present ORCHID (Oral Chinese Debate), to the best of our knowledge, the first Chinese debate summarization dataset annotated with multi-granularity and stance-specific summaries, also the first Chinese benchmark for targetindependent stance detection.Our dataset consists of 14,133 fully annotated utterances and 2,436 stance-specific summaries from 1,218 real-world debates that were conducted in Mandarin.We employed an automatic speech recognition (ASR) to transcribe raw data, followed by manual postcorrection and annotation.We provide debate summaries on two-levels of granularity, short concise statements and long comprehensive summaries of both stances.Stances and debaters are labelled at utterance level.Furthermore, we conduct a preliminary empirical study on the constructed data.Sharing this novel dataset, we hope to resupply the research community tackling tasks including dialogue summarization, stance detection and other argument mining tasks.
In summary, our contributions are three-fold: (1) we introduce ORCHID, the first Chinese dataset for debate summarization and target-independent stance detection; (2) we propose a new integrated task, stance-specific summarization, which is suggested by the experiment results to improve summarization on argumentative dialogues; and (3) we conduct preliminary experiments, benchmarking classical and newly-suggested tasks against our dataset, reporting corresponding results, and setting initial baselines for future work.

Related Work: Existing Datasets
We reviewed existing dialogue summarization and stance detection datasets as presented in Table 1 and 2. For dialogue summarization datasets, we observe a major amount imbalance between argumentative ones and non-argumentative ones.Also, argumentative dialogue summarization corpora are primarily meeting transcripts (Kumar and Kabiri, 2022), and non-English ones are rare.Stance detection studies suffer from a lack of stance detection datasets in Chinese, in particularly targetindependent ones.

Stance Detection Datasets
Most previous stance detection studies can be grouped into two categories by target-comment dependency: (1) target-specific and (2) targetindependent (Küçük and Can, 2020) Prior studies are primarily on target-specific stance detection, and the sources are centered on social media posts.SemEval-2016 (Mohammad et al., 2016) introduced a stance detection shared task on a dataset of 4,163 tweets.A few more shared task datasets were created (Derczynski et al., 2017;Gorrell et al., 2019).Regarding Chinese datasets, NLPCC-2016 (Xu et al., 2016) presented a shared task similar to SemEval-2016 and released a dataset 4,000 Chinese microblogs.CSD (Li et al., 2022) is newly released and contains 5,876 labelled website comments in Chinese on COVID-19 vaccination.
Target-Independent Datasets Turning now to target-independent stance detection datasets, Ferreira and Vlachos (2016) introduced the 'Emergent' dataset that consists 2,595 comments (news article headlines) on 300 claims.IBM Debater (Bar-Haim et al., 2017) leverages 2,394 Wikipedia articles and labelled their stances on 55 claims.More recently, Chen et al. (2019) collected 11,164 website comments on 907 claims.IAM (Cheng et al., 2022) is also created by sourcing Wikipedia articles.In addition, Durmus and Cardie (2019) constructed a large English online debating corpus with stance labels.Besides language, our dataset differs from it for having additional 'mixed' stance, stance-specific summaries, and spoken style text (Durmus and Cardie's (2019) online debates are not oral ones but by writing threads).
To the extend of our knowledge, there currently is no Chinese target-independent stance detection dataset available.Also, the source texts of existing stance detection datasets are dominantly of written rather than spoken, so our spoken-style corpus could be a rare supplement to the language resource pool.

Dialogue Summarization Datasets
Dialogues range from daily chitchat to formal debate, and researchers have introduced diverse dialogue summarization datasets.SAMSum dataset (Gliwa et al., 2019) greatly accelerated this field by proposing a large-scale dataset of fictitious chatdialogues created by linguists.English data resources that leverage other forms of dialogues have also emerged: online forum posts (Khalman et al., 2021), interviews (Zhu et al., 2021), debate evidence documents (Roush and Balaji, 2020), daily conversations (Chen et al., 2021), and screenplay transcripts (Chen et al., 2022).

Argumentative Dialogue Summarization
Datasets To reduce the scope, previous English corpora of argumentative dialogue summarization have focused heavily on meetings.AMI (Janin et al., 2003) and ICSI (Carletta et al., 2005) are two early and widely-used meeting corpora.More recently, ConvoSumm (Fabbri et al., 2021) presents 500 conversations of multiple types: news article comments, discussion forums, community question answering and email threads.QMSum (Zhong et al., 2021) uniquely developed 1,808 query-summary pairs on 232 meetings.Wu et al. (2023) introduced a Chinese dataset on 123 meetings annotated with segment-wise summaries.ELITR (Nedoluzhko et al., 2022) is a minuting corpus that consists of 120 English and 59 Czech meetings.
Despite the work made so far, there remains great imbalance between English dialogue summarization datasets and non-English ones.We also sense an urgency for expanding the diversity of argumentative dialogue corpora beyond meeting transcripts.

Creating ORCHID
Having reviewed existing stance detection and dialogue summarization corpora, we determine to construct a new versatile dataset that adapts to both tasks.To this end, we introduce ORCHID (Oral Chinese Debate), as the name implies, a corpus that features oral debates in Chinese.
We select oral debates in competition scenario as our source for following reasons: (1) debates are highly argumentative and thematic, and stances of both sides are clearly stated and not subject to change; (2) debate utterances are high-quality in terms of logic and rhetoric (Zhang et al., 2016), and such utterances are much less colloquial than daily conversations while retaining partial oral features; (3) since existing stance detection corpora are predominantly written texts, utterances of spoken styles and expressions, which debates offer, could be a valuable addition.
The construction of our dataset consists four major stages: data collection, ASR-aided transcription, manual annotation, and quality control.We first collect videos from public sources.Next, we employ an ASR system to obtain raw transcripts, followed by automatic labelling stances and manual correction.Finally, we extract and construct stance-specific position statements and conclusive summaries.We also apply several quality control measures during the process.

Data Collection
An pilot screening on publicly accessible data reveals that there is an absence of official transcripts for most debate competitions, so we alternatively utilize original debate videos as our primary sources.From major video sharing platforms where the competition organizers regularly release match videos (see Appendix E), we harvest a total of 1,704 debate videos across 59 competitions that were conducted in Mandarin Chinese.After an elaborate filtering process (see §3.4), we end up retaining videos from 1,218 debates across 30 competitions.

ASR-aided Transcription
We employ a commercially established automatic speech recognition (ASR) system developed by iFLYTEK (see Privacy and Licensing) to obtain raw machine transcription, then our annotators manually post-correct any lexical error or wrongly put punctuation.We generally follow Mirkin et al.'s (2018) pipeline to conduct the ASR-aided transcription from source videos to texts.
To be more specific, firstly, we apply the ASR system on audios to obtain raw machine transcript.Secondly, the contents other than debaters' utterances, such as accidental interruptions and postdebate comments, are discarded.However, we preserve debate adjudicators' announcements, such as pregame introductions and topic statement, for further utilization in the annotation stage.Finally, our annotators manually proofread and correct any lexical and syntactic errors appeared, including wrongly recognized words and misplaced punctuation to create a fully-cleaned transcript.

Automated and Manual Annotation
The annotation consists of four major steps: (1) extracting topics and rephrasing them into position statements, (2) segmenting debates by utterances, (3) labelling utterances with stance and debater, and (4) post-editing transcripts to produce reference summaries.
The annotation is mostly fact-grounded, since the truthfulness of topic extraction and utterance labelling can be verified by checking with original videos and hence unanimously agreed upon.We program scripts to obtain a preliminary segmentation and labelling, a manual correction is followed to correct wrongly split or labelled utterances.2. Since there are topics involving multiple domains, the number of labels is not restricted to one.
Conventionally, debate competitions in English world phrase their topics (or 'motions') into confirmatory assertions (e.g., 'Technology is ethically neutral'), which naturally are position statements.However, most competitions in Chinese world phrase their topics as binary choice questions (e.g., 'Whether technology is ethically neutral?').Therefore, our annotators manually rephrase questionstyle topics into position statements for both sides.
Debate Segmentation Debate adjudicators' announcements often contain specific signal phrases (e.g., 'Let us now invite the Third Proposition Speaker to rebuttal') that can serve as round-change indicators.We automated the preliminary segmentation by utilizing the announcements via programmed scripts.Next, the annotators correct missed or wrongly split utterances.
Stance and Debater A typical debate match consists of multiple monologue rounds and discussion (or 'rebuttal') rounds.An utterance in a debate match refers to a piece of time-constrained speech that contains multiple sentences, uttered by either one side (monologue rounds labelled by 'pro' or 'con') or both sides (discussion rounds labelled by 'mixed').
Although we have considered separating the sentences of discussion rounds by pro and con sides, yet we turned down the separation for two reasons: firstly, automatic methods of speaker diarization (Park et al., 2022) yielded unacceptable results (accuracy less than 60%), and a human handpick approach at sentence level would be highly laborintensive; more importantly, we argue that a discussion containing utterances from both sides is a complete linguistic unit.Singling out and concatenating sentences from one side is likely to yield incoherent and inconsistent contents.Therefore, we keep the integrity of discussion rounds and label them 'mixed'.
Again, the annotators utilize the adjudicators' announcements to label segmented utterances with stance and debater.The announcements are removed after the annotation, retaining only the utterances of debaters.
Reference Summaries We construct reference summaries on two levels of granularity: (1) short and concise position summaries; and (2) long comprehensive stance-specific summaries.
The short stance-specific summaries are created by directly adapting position statements.for instance, if the pro side position statement was 'technology is ethically neutral', then the short prospecific summary would be 'The pro side argues that {pro side position statement}'.By concatenating stance-specific summaries from both side, we get short overall summary: 'The pro side argues that {pro side position statement}, and the con side argues that {con side position statement}.' The long stance-specific summaries are derived from the closing remarks of debates.At the last round of a formal debate, a debater from each side is expected to provide a summative and comprehensive remark that cover key statements and arguments of their team.we asked our annotators to manually post-edit (e.g., remove greetings and change first-person to third-person) them with minimal changes to the contents (see § 3.4).The long overall summaries are created following the similar pattern as their short counterparts: 'The pro side argues that {pro-specific summary}.The con side argues that {con-specific summary}.'

Quality Control
In order to ensure the quality of the dataset, we filter out videos by following criteria: (1) we drop videos that lack complete debate contents; (2) any video without human-recognizable audio is also discarded because manual post-correction on such video was impracticable; (3) non-standard matches are also neglected (see Appendix A).
During the stance annotation is done independently by two groups of annotators of four (randomly selected from the pool of 12).Members within one group are required to reach unanimous decisions on all instances, so the two groups can be viewed as two collective annotators.Hence, we calculate Cohen's K (Cohen, 1960) to evaluate inter-annotator agreement.We obtain a Cohen's K close to 1, denoting a very high agreement between the two collective annotators.In fact, there is only 5 differences where the two groups initially disagreed upon, and both groups reached consent after a round of review.The almost perfect agreement is expected since the labelling is fact-grounded.
Regarding the reference summary construction, to avoid personal style preference and bias, we instructed annotators to minimize their editing by limiting changes to removing greetings and shifting addressing to third-person (e.g., from 'we/I/you' to 'the pro/con side').Additionally, 2 annotators were randomly selected to double-check the correctness and overall consistency of post-editing made by other annotators.

Dataset Overview
The resulting dataset is summarized in Table 3.There are a pair of stance-specific summaries and an overall summary contains both stance-specific summaries for each debate.Hence, 1,218 debates have a total of 2,436 stance-specific summaries and 1,218 overall summaries.It is to be noted that the statistics may be subject to change (see Appendix B).

Statistics
Value  Having constructed ORCHID, we conduct an empirical study to benchmark the performances of some existing methods on three challenging tasks against our dataset: (1) stance detection; (2) abstractive summarization; and (3) stance-specific summarization, a new integrated task that we propose.
To address above tasks, we split our data as summarized in Table 4.For Task 1, a total of 14,091 labelled utterances are split by roughly 78%/11%/11%.To avoid over-fitting, we make sure that utterances from debates with the same topics do not appear in both train and test sets.

Task 1: Stance Detection
Task Definition Let D = {U i = (c i , s i )} n i=1 be a debate of n utterances.Each utterance consists of a piece of text c i (utterance) and a stance s i .Let t be a designated target claim (a position statement).Given t and c i , the task aims to predict a stance ŝi ∈ {pro, con, mixed} for each U .
Experiment Setup As shown in Table 4, stances labels for utterances are imbalanced (31%/31%/38% for 'Pro'/'Con'/'Mixed' of total utterances).Hence, The task is a 3-way classification with imbalanced data, each utterance consisting one single stance label.Following Cheng et al. (2022), we report both overall accuracy and per-class (stance) F 1 scores.
We experiment on two well-established pretrained models: (1) BERT (Devlin et al., 2019) and (2) RoBERTa (Liu et al., 2019).Specifically, we implement MacBERT-base (Cui et al., 2020), an improved BERT with masked language model (MLM) as correction pre-training task, which mitigates the discrepancy of pre-training and fine-tuning.Regarding RoBERTa, we use RoBERTa-wwm-ext (Cui et al., 2021), a Chinese pre-trained BERT with whole word masking.We fine-tune the models on the train set, adjusting hyper-parameters using the validation set, run random seeds on the test set 3 times, and report the average results.
Considering that autoregressive LLMs have demonstrated unprecedented performance in many NLP tasks recently, we also devise two direct prompting methods based on GPT-3.5 (OpenAI, 2022): (3) Zero-shot Prompting, a direct prompting method with minimal instructions and (4) Fewshot Prompting (Brown et al., 2020) that adds a few utterances (three in our case) with correctly classified stances as examples in prompting (see full prompts can be found in Appendix F).Only the test set is used, and we run 3 times and report the average results.
Results As summarized in Table 5, direct prompting methods on autoregressive LLMs (GPT-3.5)outperform fine-tuned bidirectional models, BERT and RoBERTa, by a large margin.This may partially due to the complexity of the task and the limited training data available.In addition, we observe that few-shot prompting boosts GPT-3.5 further on the task.Interestingly, the F 1 scores on 'Mixed' label are better than other labels, which may suggest that it is easier to detect conflicting features in an utterance than identifying a particular stance.

Method
Acc. F 1 -Pro F 1 -Con F 1 -Mixed  Experiment Setup We follow previous works (See et al., 2017;Gliwa et al., 2019;Roush and Balaji, 2020) to carry out our experiments.We choose the well-established ROUGE (Lin, 2004) scores as automatic evaluation metrics and report standard F 1 scores of ROUGE-1, ROUGE-2 and ROUGE-L.We take overall summaries as gold references Y gold_overall .Besides automatic evaluation, we also conduct a human evaluation on generated summaries.

MacBERT
We benchmark two classic extractive methods, three fine-tuning abstractive methods and three direct prompting (specifically gpt-3.5-turbo-16k-0613)methods on our dataset.However, our single debate length (18,107 Chinese characters per debate in average) excesses the input size limits of most pre-trained models.We heuristically propose several approaches to address this issue (See full prompts in Appendix F): • Lead-K: a widely-used simple baseline taking leading k sentences (See et al., 2017).We set k to 3 for benchmarking.
• TextRank-K (Mihalcea and Tarau, 2004): a graph-based ranking model selecting k key sentences based on keyword extraction.We set k to 3 for benchmarking.
• SUMM N (Zhang et al., 2022): a multi-stage framework that adapts a coarse-to-fine approach and specialized in handling long input.
• DIALOGLM (Zhong et al., 2022): a pretrained model that features a window-based de-noising approach.The model also leverage combining sparse attention and conventional attention to process long input.
• Divide-and-Summarize: we prompt GPT-3.5 to obtain individual summary of each utterance, and then integrate summaries to form a complete summary (Koay et al., 2021).
• Accumulative Context Enhanced Divide-and-Summarize: similar to the former method, yet we provide accumulative summary of previous part of the debate as additional context.
• Iterative Revision: first, we prompt the model to obtain the summary of the first utterance, then we provide both previously generated summary and a new utterance and ask the LLM to revise the summary based on the new utterance.Repeat the process until all utterances are viewed by the model (Zhang et al., 2023).

Results
We report both automatic metrics and overall human evaluation scores in Table 6.The evaluators were requested to rate the generated summaries by four aspects: conciseness, fluency, faithfulness and informativeness.An overall average is calculated.
We observe that the abstractive methods (HM-Net, Summ N and DIALOGLM) fine-tuned on our dataset generally perform better than the extractive models Lead-3 and TextRank-3.Undoubtedly, the GPT-3.5 direct prompting approach exhibits the most satisfactory performances.Although direct LLM prompting methods yielded better results than fine-tuning methods, there is still large room for improvement.Among three prompting methods, Iterative Revision achieve the best overall results.

Task 3: Stance-specific Summarization
An overall summary of an argumentative dialogue, such as debate or meeting, is time-consuming to comprehend for readers who wish to directly capture stance-specific information.Motivated by this intuition, we propose an integrated task that combines Task 1 and 2.
Task Definition Given a debate D = {U i = (c i , s i )} n i=1 and a designated stance s 0 ∈ {pro, con}.Each utterance consists of a piece of text c i (utterance) and a stance s i .The task is to (1) produce a stance-specific subset (2) generate a stance-specific summary Y s 0 based on D s 0 .
Experiment Setup For the abstractive methods, we apply a pipeline strategy by first fine-tuning the models (HMNet, Summ N and DIALOGLM) on stance-specific utterances and their corresponding stance-specific summaries.Next, we ask GPT-3.5 few-shot prompting (for its best performance in Task 1) to distinguish and create a stance-specific subset out of the test set, and then request the models to summarize the utterances.With respect to direct prompting methods, we add simply an instruction asking the model to distinguish stances of given utterances and utilize the ones whose stance match the designated stance (see Table 12).

Results
The weak evaluation results, as summarised in Table 7, establish that the dataset is very challenging and the proposed stance-specific summarization task is worthy of future exploration on argumentative texts.End-to-end direct prompting methods are better than the pipeline abstractive methods.
While the comparability between the results from the two summarization tasks should be further examined, we singled out the R-1 scores of Summ N (for its higher results than the other two abstractive methods) and direct prompting methods in both summarization tasks for a closer comparison as shown in Figure 3.We observe that, for the three abstractive fine-tuning methods (HMNet, Summ N and DIALOGLM), the stance-specific summarization metrics are lower compared with the ones in overall summarization task.This may due to a nonperfect accuracy in the previous stance detection step.On the other hand, the performances of direct prompting methods are improved (especially Iterative Revision).This shows that a preceding highly accurate stance-detection on source text before summarizing could likely to benefit summarization tasks on argumentative dialogues, and an inaccurate one may do the opposite.

Conclusions
We have presented ORCHID, a novel dataset for target-independent stance detection and argumentative dialogue summarization in Chinese.Our dataset features in rhetorical real-world debates, multi-domain topics, stance-annotated utterances and stance-specific summaries, which invite future utilization to progress various NLP tasks.We benchmark several baseline stance detection and summarization methods on our dataset.Furthermore, we propose a new integrated task that shows potential in improving summarization on argumentative dialogues.Future work can include, for example, devising new models to obtain higher stance-detection accuracy, designing better metrics to better evaluate quality of summarization, and perhaps developing methods to scale up the size and further augment this dataset.

Limitations
We are aware of some limitations of ORCHID.
Data Bias Admittedly, despite our efforts, potential bias may have been introduced by the demographics of annotators (see Table 8), and we acknowledge it as a limitation of the annotation process.Since the selected debates are drawn from inter-university debate competitions across 2014 to 2023, all utterances were made by students in higher education institutions, and most of the debaters are of East Asian origin.Summary Formation While we argue that the final remarks of the last round of a debate match in this dataset can serve as comprehensive summaries of the whole debate.First of all, it is advocated by the feature of debate competitions we collected: a debater from each side is expected to provide a summative and comprehensive remark that cover key statements and arguments of their team, and the last round of a match is called the 'Concluding Phase' in those competitions.Secondly, compared to the summaries written by annotators, original remarks from professional debaters are more consistent with the debate contents and strongly stancebased, which are in line with the key stance-specific feature that we wish to highlight in this study.
Nonetheless taking final remarks as summaries has its limitations: in some cases, unseen information or improvised inference were added by the debaters who made the closing remarks.Also, concluding statements in competitive debate scenario are relatively long, resulting low compression ratios.In this dataset, the compression ratios (summary length / debate length) are 0.43 and 0.41 for proposition and opposition side (e.g., SAMsum's one is 0.30).

D Human Evaluation
Metrics Following Allen Institute for AI's GE-NIE leaderboard4 , we choose four aspects as human evaluation metrics.The evaluators are requested to rate generated summaries by four aspects: 1) conciseness, 2) fluency, 3) faithfulness (non-hallucination) and 4) informativeness, following a 1 to 5 rating scale.An overall score is calculated by averaging scores across the aspects with the same weights.
Instructions We randomly split the test set into 12 groups of debates, each consisting around 8 debates.Every evaluator is randomly and exclusively assigned one group of debates.For every debate, we ask the evaluators to 1) read the transcript, 2) read the gold reference summaries, 3) read and rate Figure 1: An excerpt of one debate in ORCHID.One debate entry of our dataset consists of: (1) debate topic, (2) position statements of both sides, (3) utterances labelled with speaker and stance, and (4) stance-specific summaries.Original text is in Chinese (see Appendix G for a more complete example).

Figure 2 :
Figure 2: Distribution of topic domains in ORCHID.While there are 476 unique topics, one topic could be classified into multiple domains.
* Corresponding author. 1 ORCHID is publicly available at https://github.com/xiutian/OrChiD Debate whether technology is ethically neutral?Pro Con ...tech is not conscious, so it cannot TAKE a neutral position... It is the USER's good and evil that gives tech ethical values, not tech itself...It depends on whether tech has biases that lead to differential outcomes...A gun is ethically different from a wheel, for the former tech tends to destroy... Pro Sum.Con Sum.

Table 1 :
. While much Comparison of some existing dialogue summarization datasets.Lg. denotes language: En for English and Zh for Chinese.Abs. and Ext.denotes abstractive and extractive summary.Dialogue Topic indicates whether each dialogue has headline or title.a MSAMSum contains parallel corpora of Chinese, English, French and Russian.b Part of DialogSum dialogues are argumentative.c ELITR includes 59 Czech and 120 English meetings.

Table 3 :
Overview statistics for ORCHID.Debate, utterance and summary average lengths are measured in Chinese characters.
* Closing remarks were excluded from calculation.4Benchmark and Results 2

Table 4 :
Data split statistics for benchmark testing.

Table 5 :
Results of target-independent stance detection on ORCHID dataset.Acc.denotes overall accuracy.

Table 6 :
Abstractive summarization task results reported in ROUGE scores.A.C.E.denotes Accumulative Context Enhanced.HE denotes overall human evaluation score (1 to 5 rating scale).

Table 8 :
The demographic information of the annotators.
a All evaluators have received at least undergraduate level education.

Table 9 :
Overview statistics for the 2023-06-15 snapshot of ORCHID that was tested against in §4.Debate, utterance and summary average lengths are measured in Chinese characters.* Closing remarks were excluded from calculation.

Table 14 :
G A Detailed Example of the Dataset Exploring whether technology is ethically neutral requires looking at whether the technology has a bias and whether such bias lead to clear differential outcomes.[...] * MIXED P1C4 没有立场和是中间立场有没有区别？ 我方觉得没有立场，不偏向于任何一方，就是一种中立的体现。[...] Is there any difference between having no position and taking a neutral position?We feel that having no position and not favoring either side is a manifestation of neutrality.[...] * It is clear to everyone here that technology, as an unconscious non-living entity, cannot choose to be neutral or biased.[...] * MIXED P4C1 换言之好人用就是好技术，坏人用就是坏技术。[...] In other words, good technology is used by good people and bad technology is used by bad people.[...] * PRO P2 我问你今天我手上如果有一把刀，刀是善的还是恶的？刀既可以用来杀人，医生也可以用来用 它来救人，您方的定性要怎么完成？[...] (Eng.translation)If I had a knife in my hand right now, would it be a good or evil one?The knife can be used to kill people, but doctors can also use it to save people, how do you evaluate it?[...] * MIXED P2C3 所以您方说天地不仁，以万物为刍狗，天地根本就不知道这个世界上发生了什么悲喜。[...] So you say Heaven and Earth are not benevolent, treating all things as cattle and dogs, and Heaven and Earth don't even know what joys and sorrows are happening in this world.[...] * Whenever a technology is created, human values and preferences are placed above it, which are ultimately informative and suggestive to us.[...] * But wouldn't a sharp knife kill faster, so is it still a good quality?Killing is not good under your evaluation, but sharpness of the knife is good.[...] * 不是技术本身。所以我们尊重技术，我们不应该以自己的善恶喜好去评价技术。[...] Technology is a two-sided coin.When we talk discuss a technology, it is people's good and evil that give the technology ethical values, not technology itself.Therefore, we should respect technology and should not judge it by our personal likes and dislikes.[...] * CON SUM 今天对方辩友问我们，是不是万事万物我们都要给他一个立场，都要给他一个定义？今天我们 在这里勇敢地认下来，是这样的。[...]Today, our opponents' debater asked us if we had to give everything a stance and definition.Here today, we bravely affirm that it is indeed so.[...] * An example of ORCHID dataset.One entry consists of 1) debate topic, 2) position statements from both sides, 3) utterances labelled with stance and debater, 4) reference specific-summaries of both side.The labels PRO and CON indicate proposition and opposition stance respectively, while SUM denotes summaries.1) Statements were marked with PRO-P# or CON-C# denoting debater No.# in proposition or opposition respectively.2) Discussion rounds involving both side were labelled MIXED-P#C$ indicating the debaters who participated the discussion (see Appendix C for more details on labels).*Actual data dose not include English translated text.