CroAno : A Crowd Annotation Platform for Improving Label Consistency of Chinese NER Dataset

In this paper, we introduce CroAno, a web-based crowd annotation platform for the Chinese named entity recognition (NER). Besides some basic features for crowd annotation like fast tagging and data management, CroAno provides a systematic solution for improving label consistency of Chinese NER dataset. 1) Disagreement Adjudicator: CroAno uses a multi-dimensional highlight mode to visualize instance-level inconsistent entities and makes the revision process user-friendly. 2) Inconsistency Detector: CroAno employs a detector to locate corpus-level label inconsistency and provides users an interface to correct inconsistent entities in batches. 3) Prediction Error Analyzer: We deconstruct the entity prediction error of the model to six fine-grained entity error types. Users can employ this error system to detect corpus-level inconsistency from a model perspective. To validate the effectiveness of our platform, we use CroAno to revise two public datasets. In the two revised datasets, we get an improvement of +1.96% and +2.57% F1 respectively in model performance.


Introduction
Named entity recognition (NER), the task of detecting and classifying named entities in texts, has made significant progress relying on data-driven methods (Lample et al., 2016). Existing supervised approaches to NER require massive high-quality annotated data. Since hiring annotation experts is costly and time-consuming, crowd annotation and non-experts annotators are more generally used, the drawback is a higher proportion of inconsistency in the annotation. * co-first authors, they contributed equally to this work. However, existing NER annotation tools for crowd annotation (Ogren Philip, 2006;Chen and Styler, 2013;Manning et al., 2014;Samih et al., 2016) mainly aim to improve annotating efficiency and rarely consider the dataset's consistency. Label inconsistency is ubiquitous in NER datasets. For example, OntoNotes 4.0 (Weischedel et al., 2011), which is a classical Chinese NER benchmark, the proportion of label inconsistency is up to 10% according to our estimation. In this dataset, the mention of "中国人民" (Chinese People) appears 36 times, of which the whole mention is marked as entity 23 times, and "中国" (China) is separately labeled as entity 13 times. Such inconsistency may confuse the NER model and cause disastrous results.
In this paper, we propose a web-based crowd annotation platform named CroAno. As shown in Figure 1, CroAno contains three modules to improve label consistency of the Chinese NER dataset.
Disagreement Adjudicator: Crowd annotation tools usually distribute the same instance to different annotators, which would cause disagreement. We call this phenomenon instance-level label incon- sistency, because this inconsistency occurs in the same instance. Instance-level label inconsistency is easy to locate but difficult to display and correct. YEDDA (Yang et al., 2018) employs a comparison report to show these inconsistencies to annotation experts. But it failed to display detailed information of inconsistent entities, and annotation experts cannot directly correct these entities through the comparison report. To solve these display and correction issues, CroAno uses a multi-dimensional display mode to show inconsistent instances and employs the "click to correct" method to facilitate correction.
Inconsistency Detector: This module is designed to solve corpus-level label inconsis-tency，especially corpus-level pre&suffix inconsistency. Corpus-level pre&suffix inconsistency refers to the inconsistency of whether a descriptive string is included in the entity string. In OntoNotes4.0, the string "超过"(over) is included as a prefix of Money entity in some instances while in other instances the string is excluded as an external prefix. Locating these inconsistent entities is hard because a global perspective is required. CroAno uses a detector to locate potential inconsistency and provides an interface to support users correct these inconsistent entities in batches.
Prediction Error Analyzer: As mentioned above, disagreement adjudicator and inconsistency detector are responsible to solve label inconsistency from a data perspective, while prediction error analyzer can detect label inconsistency from a model perspective. To analyze inconsistency, we deconstruct entity prediction error of the model to a novel error system, which gives entity prediction error more abundant information with six fine-grained error types. CroAno provides a search API, which supports users to employ the error system to locate specific entities and their context. Moreover, CroAno employs an elaborately designed interface to differentiate the model prediction and the annotation.
In summary, the contributions of this paper are as follows: • We propose a crowd annotation platform, which can promote label consistency of the Chinese NER dataset. The site can be accessed by http: //116.62.20.198:3000, and instruction video is provided at https://www.youtube.com/watch?v=wt2ma9F U540. To the best of our knowledge, CroAno is the first crowd annotation platform that aims at promoting label consistency of NER dataset.
• We introduce three novel modules to promote label consistency of the NER dataset. Disagreement Adjudicator provides a multidimensional highlight mode to visualize and an interface function to revise instance-level inconsistency. Inconsistency Detector employs locating and revising strategies to correct corpus-level label inconsistency. Prediction Error Analyzer can help algorithm experts detect corpus-level inconsistency from a model perspective.
• To validate the effectiveness of our platform, we use CroAno to revise two public datasets. In the two revised datasets, we get an improvement of +1.96% and +2.57% in F1 respectively. It is worth mentioning that the promotion of data can be inherited by any NER model.

User Roles in CroAno
CroAno defines three roles: the crowd annotator, the annotation expert and the algorithm expert. The crowd annotator completes basic annotation tasks. The annotation expert is responsible for guideline formulation, data management, task distribution, and label consistency optimization. The algorithm expert is responsible for providing analyses and suggestions of the annotation result from the perspective of the model. The crowd annotator uses Annotator Interface for annotation. After preliminary annotation, results are sent to the annotation expert.
The annotation expert uses Disagreement Adjudicator and Inconsistency Detector for revision. Specifically, the annotation expert uses Disagreement Adjudicator to modify instance-level inconsistency from multi-annotators, and then uses Inconsistency Detector to detect and modify corpus-level inconsistency. After that, the promoted dataset is sent to the algorithm expert.
The algorithm expert uses Prediction Error Analyzer for evaluation. The algorithm expert can use Prediction Error Analyzer to detect the difference between model prediction and annotators annotation to get promotion direction of annotation or just revise the annotation.
The following sections will introduce the main modules of CroAno.

Annotator Interface
This section describes the interface, which is designed for easily annotating. As shown in Figure 2, the interface uses different colors to distinguish the entity categories. Annotators can locate an entity and its span with the left mouse button. And They can choose an entity category by short-cut key or click the corresponding button in the entity label bar.
Except for basic annotating function, they can mark the current instance as "annotated" by clicking the first top-left button, filter "annotated instances" by clicking the second top-left button, and get the guideline by clicking the last top-left button.

Disagreement Adjudicator
This section describes Disagreement Adjudicator, which is designed for solving instance-level label inconsistency. Instance-level inconsistency means disagreement annotations between different annotators within the same instance. Apart from annotation errors, different understandings of the guideline can also lead to disagreement. This type of inconsistency needs to be revised by the annotation expert. Difficulties in implementing this operation are to visualize disagreement entities and revise.
CroAno uses a novel multi-dimensional highlight mode to visualize disagreement entities. As shown in Figure 3, underlines of text represent annotators annotation, the background represents agreement entities or approved entities. Figure 3 shows an instance annotated by two annotators. The "美 国"(America) entity has two black underlines and black background, which means the two annotators consistently annotated this span as GPE.
The revision stage is easy to execute. The annotation expert can click any disagreement entity to get a revision dialog. The dialog will display all corresponding disagreement entities, and the annotation expert can select a proper one.

Inconsistency Detector
This section describes Inconsistency Detector, which is designed for reducing corpus-level pre&suffix inconsistency.

Overview
In NER dataset, some entity types have descriptive words. It is difficult to reach an agreement on a descriptive string that should be included in the entity string. For example, the MONEY entity type has a descriptive word "more than". In some instances the word "more than" is contained inside the entity string as a prefix, while in other instances the word is excluded as an external prefix. Appendix A displays inconsistent pre&suffixes of two NER dataset and some example instances.
Our proposed framework contains a detector algorithm to detect potential inconsistent pre&suffixes and an interface to help users to revise these inconsistencies.

Pre&Suffix Inconsistency Detector
We denote the dataset as D, the entit set extracted from the dataset as S. We use entity denotes a specific entity object from entity set. An entity object has at least four attributes, including string, sentence, start, end. String denotes the entity string, sentence represents the instance text string that contains this entity. Given a specific entity, the expression sentence[start: end] == string is always true.
We use p denotes entity prefix, p i denotes prefix with length i.
We use e denotes external descriptive word before entity string, e i denotes external prefix with length i. idea of inconsistency detector is that if a descriptive string appears both as a prefix or an external prefix, this string is considered an inconsistent prefix.
We first construct two empty string sets called prefix_set and external_prefix_set. We then traverse the entity set to add entity prefix and exter-nal_prefix respectively to the two string set.
After the traverse, the intersection of two string sets are strings that appearing in both entity prefix or entity external prefix. These strings are potential inconsistent prefixes.

Prediction Error Analyzer
This section describes Prediction Error Analyzer, which is designed for detecting label consistency from the model perspective.

Overview
From an intuitive perspective, The basic idea of this module is that the NER model trained with trainset can learn regulations of this annotation. When an entity gets a wrong prediction compared to annotation result in testset, it has a high proportion that the model has learnt an inconsistent regulation from trainset.
To support the need to detect inconsistent entities with the NER model, CroAno provides a willdesigned error system, a search API based on the error system, and a visualization interface for users to compare the differences between model prediction and annotations from the annotator.

Error System
Existing entity error only measures if an entity is correctly predicted or not, which misses much useful information. CroAno deconstruct entity prediction error to a well-designed error system.
The error system consists of six different error types. Extra Error and Missing Error means that the NER model predicts an extra entity or misses an entity compared with annotators annotation. Long Error and Short Error means that the NER model predicts an entity that contains or is within an annotated entity. Tag Error and Intersect Error means different tags and intersect boundary between model prediction and annotation.

Search and Visualize Instances
The search API is used to filter entities and their context by the error system and other features. For example, if the algorithm expert finds that the PER-SON entity type has a high proportion to be mistakenly missing in model prediction, they can use the search API to extract these instances.
The visualization interface is designed to make users directly perceive the difference between model prediction and annotations from the annotator. As shown in Figure 5, annotations from the annotator are marked by the background, while the highlighted underline exhibits annotations predicted by the model. Except for detecting corpus-level inconsistency of the NER dataset, Prediction Error Analyzer can be used to detect annotation errors or catch the model's weakness to decide the improvement direction of the model itself.

Technical Details
This section introduces some necessary technical details. CroAno has a web-based front-end server built on Vue in JavaScript and a back-end server built on Django in Python. CroAno has an environment decoupled from the operating system, which means it is deployment-free. CroAno implement the interface design based on an open-source Django framework named doccano 1 .

Experiments
In this section, we conduct experiments on a standard benchmark and a medical benchmark to verify the effectiveness of Inconsistency Detector.

Experimental Setting
Dataset. Two Chinese NER datasets are used in this paper, which include OntoNotes 4.0 (Weischedel et al., 2011) and CCKS 2019 (Han et al., 2020). OntoNotes 4.0 is collected from the news domain, while CCKS 2019 is collected from the medical domain. For OntoNotes 4.0, we use the same data split as (Zhang and Yang, 2018). Since the CCKS 2019 dataset does not have a development set, we randomly select 20% samples from the training set as the development. Statistics of dataset is shown in Table 1.
Dataset Correction. We invite two volunteers to use Inconsistency Detector to correct the two datasets. The correction is for the entire dataset including training set, development set, and test set. It is worth mentioning that to count the learning and mastering time of the function, modify both datasets took less than half an hour.
Model Settings. We appliy two most widely recognized NER baseline models, denoted as BiLSTM-CRF (Ma and Hovy, 2016) and BERT-Tagger (Devlin et al., 2018), respectively. We uniformly use AdamW (Loshchilov and Hutter, 2018) as an optimizer. For (Ma and Hovy, 2016), we use the same character embeddings as (Zhang and Yang, 2018) and set the initial learning rate to 0.01.
For (Devlin et al., 2018), we set the initial learning rate to 0.00005.

Results
The model performance before/after correcting is shown in Table 2 and the statistics of inconsistent entities correction is shown in Table 3.
OntoNotes 4.0. We correct 730 entities in total, accounting for 1.4% of the total number of entities. After correction, BiLSTM-CRF reachs a 69.38% F1-score, an increase of 1.48%. The improvement of BERT-Tagger is even more significant, an increase of 1.96%, reaching 76.52% F1-score.
CCKS 2019. We correct 700 entities in total, accounting for 4.3% of the total number of entities. After correction, BiLSTM-CRF reached a 82.88% F1-score, an increase of 2.09%. Consistent with Ontonotes, the improvement of BERT-Tagger is more significant, an increase of 2.57%, reaching 85.88% F1-score.
The experiments prove that the annotator expert can use CroAno to promote dataset, and this promotion can be inherited by widely recognized BiLSTM-CRF model and BERT-Tagger model.

Related Works
Most of the existing crowd annotation tools for NER are dedicated to basic functions such as interface friendliness, the convenience of operation, and annotation prompts, which are all designed for annotators. Next, this section will compare the features of CroAno with the following related works: BRAT (Stenetorp et al., 2012) is a web-based general annotation tool that can handle various annotation tasks, including span annotations and the relationship between spans. As an early annotation tool, BRAT has great influence. However, compared with crowd annotation platforms such as CroAno, it cannot manage crowdsource or reduce the label inconsistency.
GATE (Bontcheva et al., 2013) is a web-based collaborative text annotation framework. It enables users to perform complex corpus annotation projects, which involve a distributed team of annotators. Same as brat, GATE also lacks the ability to reduce the label inconsistency.
SLATE (Kummerfeld, 2019) is a lightweight annotation tool with a terminal-based workflow. It is designed for annotation experts, focusing on fast labeling. SLATE has a certain visual disagreement adjudication capability. However, due to the lim-   ited visualization ability of the terminal-based approach, it can only simply prompt disagreement samples. Besides, the disagreement adjudication cannot be applied at the entity-level.
AlpacaTag (Lin et al., 2019) applies the model ensemble mechanism to merge the results of different annotators, thereby realizing disagreement adjudication without relying on annotating experts. It is worth noting that the results of such a black box model are not interpretable for humans. Especially for fields such as clinical diagnosis and drug discovery, black box models often mean potential risks and poor persuasiveness.
YEDDA (Yang et al., 2018) provides a systematic solution for text span annotations such as collaborative user annotations, administrator evaluation. YEDDA can locate the position of disagreement and display it by generating a Latex file, but it cannot directly adjudicate disagreement like CroAno.
To the best of our knowledge, CroAno is the first crowd annotation platform providing the tool that can locate and fix inconsistent labels. CroAno can also make the full use of the ability of the algorithm expert and the annotation expert.

Conclusion and Future Directions
In this paper, we propose a web-based crowd annotation platform, which provides a systematic solution for improving the label consistency of the Chinese NER dataset. To solve instance-level inconsistency, we propose Disagreement Adjudicator. To solve corpus-level inconsistency, we propose Inconsistency Detector and Prediction Error Analyzer from statistic and model perspectives respectively.
The future directions are to extend CroAno to a cross-task and multi-language version and implement more prosperous model analysis functions.

A Inconsistent Pre&Suffix Example
The cause of the inconsistent prefixes and suffixes is divergent perceptions of different annotators on the annotation guideline. For example, some entity categories have specific descriptive strings. Because deleting these additive strings does not change the entity semantics, this leads to opposite annotation strategies. Some of the inconsistent prefixes or suffixes detected by this algorithm in the Ontonotes dataset and the CCKS medical dataset are shown in Table4 and Table5. Some instances that contain inconsistent pre&suffix in the Ontonotes dataset are shown in Table6.