DialogSum: A Real-Life Scenario Dialogue Summarization Dataset

Proposal of large-scale datasets has facilitated research on deep neural models for news summarization. Deep learning can also be potentially useful for spoken dialogue summarization, which can benefit a range of real-life scenarios including customer service management and medication tracking. To this end, we propose DialogSum, a large-scale labeled dialogue summarization dataset. We conduct empirical analysis on DialogSum using state-of-the-art neural summarizers. Experimental results show unique challenges in dialogue summarization, such as spoken terms, special discourse structures, coreferences and ellipsis, pragmatics and social common sense, which require specific representation learning technologies to better deal with.


Introduction
Text summarization is the task of automatically generating a concise, salient, coherent and fluent summary of a given set of documents (Radev et al., 2002). Thanks to the advance in neural network models and the availability of large-scale labeled datasets, recent research has achieved promising progress on summarizing monologic texts such as news articles (Paulus et al., 2018;Gehrmann et al., 2018;Liu and Lapata, 2019;, patents (Pilault et al., 2020) and academic papers (Koncel-Kedziorski et al., 2019).
However, dialogue, as an important channel for achieving communicative intents (Bender and Koller, 2020), has received significantly less attention from the summarization research community. One main reason is the paucity of a suitable summarization dataset built on dialogue texts. Most existing research uses the AMI meeting corpus (Carletta et al., 2005), which consists of 137 dialogues obtained from virtual multi-party meeting recordings. However, research on the corpus is limited to its (a) Dialogue from DIALOGSUM: #Person_1#: Good morning. I wonder whether you have got an answer from your superior. #Person_2#: Yes, we had a meting about it yesterday afternoon. #Person_1#: What's the answer? #Person_2#: We decided that we could agree to your price, but we are a bit worried about the slow delivery. #Person_1#: Let me see. I quoted your delivery in three months, didn't I? #Person_2#: Yes, but we hope that the wool could reach us as soon as possible. #Person_1#: I thought you would. So I rang Auckland last night. As you are our biggest customer, they agreed to ship the order on the first vessel available that will leave Auckland next month. #Person_2#: Good, if you agree we'll draft the agreement right away and sign it then. #Person_1#: By all means.
Summary from DIALOGSUM: #Person_1# and #Person_2# agree to sign an agreement since #Person_1# could speed up the delivery as #Person_2# hopes. small scale. SAMSum (Gliwa et al., 2019) is a recently released written online dialogue summarization dataset, which contains 16k online chats with corresponding summaries. However, it focuses on conversations via messenger apps, which are rather short (around 94 tokens per conversation) and their language style and topics also differ from spoken daily dialogues.
A comparison between the real-life scenario dialogue and online chat is shown in Figure 1. Online-chat messages contain unique tokens (e.g., "BTW"), emoticons (e.g., ":)") and emojis (e.g., " "). In contrast, daily conversations have a different and more formal style. In addition, real-life dialogues have more diverse task-oriented scenarios and topics compared to online chit-chats. For example, online-chat messages in SAMSum are about leisure and social chats, but real-life dialogues contain business negotiation (Figure 1(a)). Intuitively, automatically summarizing such dialogues can help a business find common needs or complaints from customers. With the rise of personal assisting chatbots, summarizing dialogues from different aspects of daily life can also be useful for personal record management and other applications.
We introduce Real-Life Scenario Dialogue Summarization (DIALOGSUM), a large-scale summarization dataset for dialogues. Dialogue data for DIALOGSUM are collected from three public dialogue corpora, namely Dailydialog (Li et al., 2017), DREAM (Sun et al., 2019) and MuTual (Cui et al., 2020), as well as an English speaking practice website. These datasets contain face-to-face spoken dialogues that cover a wide range of daily-life topics, including schooling, work, medication, shopping, leisure, travel. Most conversations take place between friends, colleagues, and between service providers and customers. We clean and preprocess the dialogue data into a unified format, and ask annotators to summarize them from an observer perspective. Topics are also manually labeled for each dialogue. An example of DIALOGSUM is shown in Figure 1(a), where the summary expresses the main content in a business conversation.
The contribution of DIALOGSUM can be stated from two perspectives. First, from the perspective of downstream applications, summarizing daily spoken dialogues can be useful for both business and personal uses. Dialogue summaries can also be useful for personal assistants to keep track of important events as such business negotiation. Second, from the method perspective, DIALOGSUM has a larger scale of long dialogue data, which can facilitate the study of dialogue summarization using neural network models. The number of dialogues in DIALOGSUM is orders of magnitude larger than in AMI, which can be useful for training large neural network models for dialogue summarization. The average length of dialogues in DIALOGSUM is 39.8% longer than in SAMSum. To our knowledge, we are the first to release a large-scale real-life scenario dialogue summarization dataset.
We empirically investigate the performance of state-of-the-art neural summarization models on DIALOGSUM, comparing the characteristics of the spoken daily dialogue summarization dataset with standard news summarization benchmarks and the online chat summarization benchmark SAMSum. Experimental results show that DIALOGSUM is more amenable to abstractive summarizers, while being relatively more challenging compared to the existing summarization datasets. We find that main difficulties arise from discourse structures in multi-turn dialogues, as well as the need for book-keeping both entities and events mentioned in turns of utterances. We release our dataset at https://github.com/cylnlp/DialogSum.

Dialogue Data Preparation
Data Collection DailyDialog is a dataset consisting of 13k multi-turn dialogues, obtained from websites that aim to help English learners to practice English speaking. DREAM and MuTual are dialogue understanding datasets, consisting of 6k and 9k speech transcripts, respectively, both collected from online English listening exam materials. In order to further increase the diversity of data, we crawl additional dialogues from another English speaking practice website 1 which aims to provide English learners with conversation examples in real life practical circumstances, such as business negotiation and banking services.
Although dialogues of DIALOGSUM are from different sources, they all share important characteristics that are in line with what we expect. First, as mentioned earlier, these dialogues are under rich  real-life scenarios. Unlike chitchats, these conversations have clear communication patterns and intents, making them more suitable and valuable to serve as summarization sources (Carletta et al., 2005). Moreover, their multi-turn dialogue lengths are within a reasonable scale and are longer than chitchats 2 , which comforts the purpose of automatic summarization. Greater lengths also indicate these dialogues contain more events and discourse relations between them. Properly selecting vital events and identifying their relations make summarizing these dialogues more challenging.
Data Cleaning and Pre-Processing We delete non-English characters, correct typos and grammatical errors, and further filter out duplicated data based on text similarity. After deduplicating, proportions of the data sources are summarized in Table 2. Because of different data processing methods and annotation procedures, original dialogues in DailyDialog, DREAM and MuTual are in different formats. We follow previous work (Li et al., 2017;Zhang et al., 2018;Budzianowski et al., 2018;Dinan et al., 2019) and preprocess them into a biturn dialogue flow, merging continuous turns of the same speaker into one utterance. Also, we add tags (e.g. #Person 1# and #Person 2# in Figure 1(a)) before each dialogue turn, to distinguish speakers. The final DIALOGSUM dataset contains 13,460 dialogues, which are divided into training (12,460), validation (500) and test (500) sets.

Annotation
We ask annotators to write dialogue summaries based on following criteria: the summary should (1) convey the most salient information of the dialogue and; (2) be brief (no longer than 20% of the conversation length) and; (3) preserve important named entities within the conversation and; (4) be written from an observer perspective and; (5) be written in formal language.  We require our annotators to pay extra attention to the following aspects.
Tense Consistency: Annotators should take the moment that the conversation occurs as the present time, and choose a proper tense to describe events before and after the ongoing conversation.
Discourse Relation: If summarized events hold important discourse relations, particularly causal relation, annotators should preserve the relations if they are also in the summary.
Emotion: Different from newspaper and academic articles, social conversations in DIALOG-SUM are often implied with emotions. Therefore, we ask annotators to explicitly describe important emotions related to events in the summary.
Intent Identification: Rather than merely summarizing the consequences of dialogues, annotators should also describe speakers' intents in summaries, if they can be clearly identified.
In addition to the above, annotators should use person tags to refer to different speakers if real names cannot be detected from the conversation. Annotators are also asked to write a short (around 3 tokens) topic for each dialogue. Appendix A shows the list of topics.

Quality Control
To ensure quality, before formal annotation, we ask annotators to annotate training samples until they pass our examination and meet our requirements. After annotation, we check summaries by crossvalidation between different annotators twice. During the checking process, bonus is paid to checkers who find unqualified summaries, and penalty is given to annotators whose annotation is found with mistakes. In case of appeal, we make the final decision. After the second checking, we sample 10% summaries and manually check the samples ourselves. If errors are found in an annotation batch, we ask corresponding annotators to self-check and re-annotate the whole batch and repeat this checking and sampling processes.
To further control the quality, and to analyze inter-annotator agreement, for each dialogue in  the test set, we provide three summaries written and checked by different annotators. For each test dialogue, we compare its three summaries and compute their pair-wise ROUGE (Lin, 2004) scores. Table 3 reports their averaged F 1 scores of ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL). We see R2 is relatively low while RL is high, which suggests that annotators' usage of language is variable, but the main content and logical order are mostly the same.

Characteristics of DIALOGSUM
We empirically compare DIALOGSUM with existing news summarization datasets and SAM-Sum. CNN/DailyMail (Hermann et al., 2015), NY Times (Sandhaus, 2008) and XSum (Narayan et al., 2018) are large-scale summarization datasets from the news domain, written in a monologic structure. XSum is a dataset designed specifically for abstractive summarization. First, we compare the percentages of novel ngrams in the reference summary against the source document/dialogue. This intuitively reflects the level of abstraction of annotated summaries. As shown in Table 4, except for XSum, which is designed to be highly abstractive, dialogue-based summarization datasets contain more novel ngrams in the summaries. We also find that the percentage of novel unigrams in DIALOGSUM is 26%, 6% lower than in SAMSum, but novel bigrams, trigrams and 4-grams are about the same as SAM-Sum. We believe that the relatively lower novel unigram proportion in DIALOGSUM compared to SAMSum is because of our pre-processing and annotation criteria. SAMSum's summaries include real names, third-person singular pronouns, which can be diverse across the dialogues. In contrast, DIALOGSUM uses tags such as #Person 1# to refer to persons whatever they are subjective, objective, or possessive. This constrains the proportion of novel unigrams to be lower.
Second, we compare the datasets using several extractive summarization methods. Following previous summarization work (Liu, 2019;Pilault et al., 2020), we report R1, R2 and RL here. LEAD creates summaries by selecting the first n sentences from source texts. LONGEST is designed for dialogue summarization (Gliwa et al., 2019). It selects the n longest utterances as a summary, which gives better ROUGE scores than LEAD on SAMSum. EXT-ORACLE creates summaries by choosing n sentences that have the highest ROUGE against reference summaries. It can be viewed as an upper bound for extractive summarization. We report results of LEAD-3, LONGEST-3 and EXT-ORACLE-2 on SAMSum, and LEAD-2, LONGEST-2 and EXT-ORACLE-2 on DIALOGSUM, where n is searched for each dataset in range of 1 to 6.
The results are shown in Table 4. In terms of LEAD, DIALOGSUM sees the lowest R1 and R2 except for XSum, showing that it is in nature a highly abstractive summarization dataset. SAMSum is less abstractive than DIALOGSUM by all ROUGE scores, which is likely because the compression rate of SAMSum (0.30) is higher than DIALOG-SUM (0.18) ( Table 1). The higher compression ratio suggests the summary contains denser information in the original text. The same conclusion can be found by using the LONGEST method. By using the EXT-ORACLE method, we find that DIALOG-SUM is the most challenging dataset for extractive summarizers except for XSum, which is carefully designed for evaluating abstractive summarizers.

Experiments
We experiment with several abstractive summarization baselines to further understand the characteristics and challenges of DIALOGSUM. Following Gliwa et al. (2019), we concatenate utterances of a dialogue as the input. For pretrained models, we only finetune them on corresponding datasets.  BART BART (Lewis et al., 2020) is an encoderdecoder Transformer model pretrained on a large corpus using a denoising autoencoder task. We use the large version of BART and finetune it with 5,000 training steps/200 warmup steps for dialogue summarization. Learning rate is set to 3e −5 .

Results
Table 5 presents the experimental results. In general, we find that non-pretrained abstractive models outperform LEAD ( RL on SAMSum, and 11.33 ∼ 11.54% on DI-ALOGSUM, while only 3.05 ∼ 3.81% on CNNDM. We believe that it is because CNNDM is a highly extractive dataset (Section 2.4). The key to summarizing CNNDM is to correctly understand intersentence relations within long documents, and extract important sentences. In contrast, XSum, SAM-Sum and DIALOGSUM are more abstractive, which require a model to paraphrase. And the strong generation capability of pretrained models can bring great improvements on them. We also see that, for abstractive datasets, model performance decreases as document length grows (Avg. length: SAMSum -93.8, DIALOGSUM -131.1, Xsum -431.1) and compression rate decreases (Comp. rate: SAMSum -0.30, DIALOGSUM -0.18, XSum -0.05). This explains why SAMSum is the easiest dataset.
Spoken vs Written All three models perform better on dialogue summarization datasets, compared with XSum. This can be potentially because XSum is naturally highly abstractive, and thus more challenging. We also compare improvement brought by pretrained models that are trained on large written texts.
Still in Table 5, the improvement on DIALOG-SUM is the least. BART LARGE outperform Transformer by 15.73% R1 on XSum, 15.92% R1 on SAMSum, but 11.37% R1 on DIALOGSUM. It demonstrates that SAMSum has overall more written style than DIALOGSUM, and also suggests that dialogue and monologue are different. This can be explained by the design of written app-chat annotation in SAMSum (Gliwa et al., 2019). Table 5, model performance is steadily lower on DI-ALOGSUM than SAMSum. As stated, DIALOG-SUM is more abstractive, open-domain, and spoken analogous. One more possible reason for the lower performance on DIALOGSUM is the longer input size. To better quantify the difference between these two dialogue summarization datasets, we further evaluate Transformer trained on DIALOGSUM when tested on the SAMSum, and vice versa. As   shown in Table 6, the performance of Transformer drops greatly when traiend on DIALOGSUM and tested on SAMSum, and vice versa. This shows that the two datasets have substantial differences.

DIALOGSUM vs SAMSum As shown in
In addition, Transformer trained on DIALOGSUM performs better than on SAMSum, and shows lower performance drop, suggesting that DIALOGSUM can provide more generalization ability for training dialogue summarization models.

Human Evaluation
To better understand DIALOGSUM, we take a deeper investigation into the outputs of Transformer and UNILMV2 on DIALOGSUM by conducting human evaluation from multiple aspects.
Fluency, Consistency, Relevance and Coherence First, following Kryscinski et al. (2019Kryscinski et al. ( , 2020, we implement human evaluation from four dimensions. Fluency evaluates the quality of individual generated sentences, Consistency evaluates the factual alignment between the source text and generated summary, Relevance evaluates the importance of summary content, and Coherence evaluates the collective quality of all sentences. We randomly select 50 dialogues and their summaries from DIALOGSUM test, and ask a judge to give scores in scale from 1 to 5 along the four mentioned dimensions. The higher, the better. The judge also gives scores to human-annotated summaries to evaluate their quality. As shown in Table 7, human annotated summaries receive the best scores from all dimensions. UNILMV2 BASE has steadily better scores than Transformer, but lower than human. Model-generated summaries have the  highest scores on Fluency, while lowest on Consistency. It suggests that although model-generated summaries are grammatical and fluent, they still contain factual errors. Discourse Relation Reasonable summaries should convey important relations between main events, and identifying discourse relations and using proper phrases to express them in summaries can be challenging for summarization systems (Xu et al., 2020). Take Figure 1 (a) for example, the human annotated summary connects two main events (underlined) using "since" to express their causal relation explicitly. However, the causal relation between those two events is not explicitly expressed in the dialogue, and the distance between them is long. Multiple turns usually correspond to more complicated discourse structure and relation. Also, similar with Chen and Yang (2020), we find that model performance decreases when the number of dialogue turns grows (See Appendix B).
To better evaluate model ability to disambiguate discourse relations in DIALOGSUM, we first collect discourse connectives from Penn Discourse Treebank (Miltsakaki et al., 2004), and check whether these connectives are included in summaries in the testset. If the three reference summaries of a dialogue all contain connectives, we assume that the dialogues have strong discourse signals. We choose 70 dialogues from DIALOGSUM in this way.
We then ask linguists who specialize in discourse to evaluate model outputs and give scores from {−1, 0, 1}, where 1 means that the generated descriptions of main events are reasonable and contain correct discourse connectives, 0 means that the descriptions are good but contain no discourse connectives and −1 means that the description is either incorrect or contains incorrect connectives. We ask the linguists to focus only on clauses or phrases that are essential to discourse relations, and ignore syntactic errors. We report the distribution of annotated scores in Table 8.
We can see that the most summaries generated by Transformer are scored as −1, and their aver-age score is −0.77, close to −1. This means that Transformer is not only incapable of identifying discourse relations but also incapable of generating the main events correctly. UNILMV2 has a relatively smooth distribution over three categories and a better average score of −0.23, which is closer to 0, suggesting that UNILMV2 can mostly choose important events amongst the conversation. But the −1 still holds most proportion and its average result is still far from 1, indicating its incapability of understanding relations between events.
Compared to the full test set, the model performance on this sub-set generally decreases (1.56 ∼ 3.26% lower of R1, 1.73 ∼ 3.22% of R2, 2.37 ∼ 4.07% of RL), which also suggests complicated discourse relations between events make summarization more difficult. The results indicate that further research is necessary for better representing dialogue discourse structures in order to obtain more reliable summarization systems.
Coreference Information To evaluate model's ability to distinguish different interlocutors, we ask a judge to evaluate whether interlocutors' names and their conversation actions/contents are correctly associated in the 50 randomly selected data, and give scores from {−1, 0, 1}, where 1 means that all names and actions/content in the summary are associated correctly, 0 means partial incorrectly, and −1 means all incorrectly. Here, we only focus on coreference information in generated summaries, and ignore other errors, such as incorrect syntax or failing to summarize salient information.
We report the distribution of annotated scores in Table 9. Most Transformer generated summaries are annotated as −1 and the average result is close to −1, suggesting that Transformer cannot generate clauses that express the same relation between arguments and predicates in original dialogues. The UNILMV2 BASE has more 0scored summaries, and the result is much higher, yet closer to 0, which indicates that although UNILMV2 BASE can generate summaries containing correct clauses, but still have much inconsistency. The performance of both models indicates that Transformer is only capable of extracting important word-level information from dialogues in DIALOGSUM, while UNILMV2 BASE shows better performance on clause-level -it can identify the speakers and partially preserve coreference information, consistent with findings of Levesque et al. (2012) that pretraining is useful for corefer-   ence resolution. However, it is far from human annotations.
Intent Identification As stated in Section 2.2, we ask annotators to include important intents of interlocutors in their summaries, addition to the consequences of dialogues. The intent here refers to the motivation of a speaker to initiate a conversation, e.g. "want to do an annual physical" (c.f. Figure 2, DIALOGUE-A). This can make summaries more comprehensive and readable. Therefore, we conduct corresponding human evaluation on whether interlocutors' intents are described in summaries in the 50 randomly selected data. We first ask a judge to evaluate whether the intent is important to a dialogue, and we select 39 dialogues that contain important intents. Then, we ask the judge to give scores from {−1, 1}, where 1 means that intents are identified correctly, −1 means incorrectly. Note that we only focus on intent identification in the summary, and other errors should be ignored. We also ask the judge to evaluate human annotated summaries.
The distribution of annotated scores is shown in Table 10. We see that most summaries generated by Transformer are scored as −1, which means that Transformer is incapable of generating summaries that correctly convey speakers' intents. UNILMV2 BASE shows much better performance, however, it is still below human performance.

Challenges in DIALOGSUM
Compared to written texts, spoken dialogues can be more difficult for models to understand, and SUMMARY -A1: #Person_2# wants to do an annual physical examination to apply for new health insurance and says #Person_2#'s breathing is not good. #Person_1# explains the items and will do tests on #Person_2#'s breathing. SUMMARY -A2: #Person_1# explains the checking items in #Person_2#'s annual physical examination and will do test to look into #Person_2's breathing. SUMMARY -A3: #Person_2# is going through an annual physical examination to apply for new health insurance, and #Person_2# asks #Person_1# to look into the breathing.
UNILMV2: #Person_2# comes to #Person_1#'s annual physical to apply for new health insurance. #Person_1# will do an allergy test, an asthma test, and a blood test.
Transformer: #Person_2# checks out with #Person_2#'s assistance and thinks they'll be very sorry for the laundry service. to summarize (Goo and Chen, 2018). Therefore, we conduct error analysis and case studies on DI-ALOGSUM to quantitatively and qualitatively discuss such challenges.

Error Analysis
We make error analysis on the 50 selected modelgenerated summaries (Section 4). Table 11 summarizes the five most frequent error types and their error rates. In general, UNILMV2 BASE shows better performance than Transformer, but its error rates are still high. In particular, incorrect coreference (c.f. Section 4) sees the highest error rates for both models, indicating that models can be confused because of interactive information flow. Compared with Transformer, UNILMV2 BASE can greatly avoid errors regarding unfactual information (−52%) and syntactic (−50%). However, it still suffers from coreference issues, and tends to generate redundant summaries.

Case Study
We demonstrate two dialogues and their humanannotated/system-generated summaries in Figure 2. First, a big challenge posed by spoken dialogues is that their information flow is different from monologic text, which is intuitively reflected in the dialogue discourse structures (Wolf and Gibson, 2005). For example, two utterances can be closely related even where there is a large distance between them. Such phenomena are common in spoken dialogues such as negotiations and procedures (e.g., medical consultation and police reports). Due to the unique structure of the spoken dialogue, important information is rather dispersed than wellstructured monologues and written-dialogues.
Regular greetings can be useless to written dialogue summaries (e.g. SAMSum), which is reflected by that LEAD is worse than LONGEST on SAMSum (Table 4). In contrast, LEAD outperforms LONGEST by over 3% on DIALOGSUM. This is because, for spoken dialogues, such utterances sometimes express and indicate essential intents of speakers (c.f. Section 4). Farewells also express the dialogue consequence and future plan of the speakers (e.g. dialogues in Figure 2). Besides, interruptions appear frequently in the middle of conversations (Figure 2, DIALOGUE-B). These interruptions make other speaker's utterances incomplete, adding redundant information, and can also destroy coherent discourse structures, making dialogues more difficult to encode. These characteristics also make information in DIALOGSUM dialogues more dispersed than existing datasets.
Second, coreference and ellipsis are frequent in spoken dialogues (Grosz et al., 1995;Quan et al., 2019). It is a natural behavior of communication that humans obey as a rhetorical principle for saving words and avoiding repetitions. Although it can be trivial for humans, their understanding can be challenging to a neural model. For example, to correctly generate "mischarged/wrong" in SUMMARY-B1-SUMMARY-B3, models need to understand "I think you have added someone else's (laundry service on my bill)", where "my bill" refers to "#Person 2#'s bill".
Third, pragmatics and social common sense give a unique challenge for spoken language understanding and has a significant impact on summarization. From the last two sentences of DIALOGUE-B, human could understand that the "Here you are" is actually "make a payment", and "Goodbye" indicates that the event "check out" is finished. It requires commonsense knowledge to fully understand such dialogues. Beside, dialogues are summarized from a different perspective (compared with speakers' perspective), which suggests that summarizing dialogues needs to go beyond summarizing dialogue contents, but also dialogue actions at the pragmatic level. For example, "explains" in SUMMARY-A1 and SUMMARY-A2 summarizes multiple dialogue actions of #Person 1#, "agree" in Figure 1 (a) summarizes actions of both speak-ers. It requires model to not only summarize what speakers are saying, but also what they are doing.

Conclusion
We presented DIALOGSUM, a large-scale dialogue summarization dataset, investigating its characteristics and challenges empirically. Experiments with typical models show that DIALOGSUM is highly abstractive, and poses unique challenges in discourse and complex co-references. From these observations, we made discussion on the uniqueness of spoken dialogue summarization, listing several key problems to consider in future modeling. To our knowledge, we are the first to release a large-scale dataset for real-life scenario dialogue summarization.

Ethics Consideration
As mentioned, we collect our data from Daily-Dialog, DREAM and MuTual that all are public for academic use. The additional data are from www.tingroom.com, which are available to the public as well. The sources of our dialogue data are freely accessible online without copyright constraint to academic use.
We hired annotators who have degrees in English Linguistics or Applied Linguistics. Before formal annotation, we annotated 50 samples randomly extracted from the dataset, and calculated our average annotation time so we could set a fair salary for annotators' training annotation. During the training annotation process, they were paid as well. We also calculated the average annotation time for each dialogue during training, based on which we determined the final salary was around 9.5 dollars per hour. This hourly salary was the same for manual checking. All of our annotators took this annotation as a part-time job.