A Large-Scale Chinese Multimodal NER Dataset with Speech Clues

In this paper, we aim to explore an uncharted territory, which is Chinese multimodal named entity recognition (NER) with both textual and acoustic contents. To achieve this, we construct a large-scale human-annotated Chinese multimodal NER dataset, named CNERTA. Our corpus totally contains 42,987 annotated sentences accompanying by 71 hours of speech data. Based on this dataset, we propose a family of strong and representative baseline models, which can leverage textual features or multimodal features. Upon these baselines, to capture the natural monotonic alignment between the textual modality and the acoustic modality, we further propose a simple multimodal multitask model by introducing a speech-to-text alignment auxiliary task. Through extensive experiments, we observe that: (1) Progressive performance boosts as we move from unimodal to multimodal, verifying the necessity of integrating speech clues into Chinese NER. (2) Our proposed model yields state-of-the-art (SoTA) results on CNERTA, demonstrating its effectiveness. For further research, the annotated dataset is publicly available at http://github.com/DianboWork/CNERTA.


Introduction
"Speech is a part of thought." -Oliver Sacks, Seeing Voices As a fundamental subtask of information extraction, named entity recognition (NER) aims to locate and classify named entities mentioned in unstructured texts into predefined semantic categories, such as person names, locations and organizations. NER plays a crucial role in many natural language processing (NLP) tasks, including relation extraction (Zelenko et al., 2003), question answering (Mollá et al., 2006) and summarization (Aramaki et al., 2009). Most of the research on NER, such as Lample et al. (2016); Ma and Hovy (2016); Chiu and Nichols (2016), only relies on the textual modality to infer tags. However, when texts are noisy or short, and it is not sufficient to locate and classify named entities accurately only based on textual information (Baldwin et al., 2015;Lu et al., 2018). One promising solution is to introduce other modalities as the supplement of the textual modality. So far, some studies on multimodal NER, such as Moon et al. (2018); ; Lu et al. (2018); Arshad et al. (2019); Asgari-Chenaghlu et al. (2020); ; ; Sun et al. (2020), have attempted to couple the textual modality with the visual modality and witnessed a stable improvement.
In this work, we also focus on multimodal NER. But differently from previous studies, we pay special attention to Chinese multimodal NER with both textual and acoustic contents. The motivation comes from two aspects: First, despite much recent success in multimodal NER, current studies on this topic are limited in English, and totally skirt other languages. Meanwhile, previous work on Chinese NER, such as Xu et al. (2013); Peng and Dredze (2016a); Zhang and Yang (2018); Cao et al. (2018); Sui et al. (2019); Gui et al. (2019); Ma et al. (2020); , totally ignores valuable multimodal information. With around 1.3 billion native speakers and the wide spread of short-form video apps in China, it is necessary and urgent to carry out research on Chinese multimodal NER.
Second, unlike the static visual modality, the time-varying acoustic modality plays a unique role in Chinese NER, especially in providing precise word segmentation information. In detail, different from English, Chinese is an ideographic language featured by no word delimiter between words in written. This language characteristic is one of the major roadblocks in Chinese NER, since named entity boundaries are usually word boundaries (Zhang and Yang, 2018). Fortunately, cues contained in the fluent acoustic modality, especially pauses between adjacent words, are able to aid the NER model in discovering word boundaries. A classic example shown in Figure 1 can perfectly illustrate this point. In this example, the sentence with ambiguous word segmentation would be disambiguated with the aid of the acoustic modality, which would absolutely assist the model to infer correct NER tags.
In this work, we make the following efforts to advance multimodal NER: First, we construct a large-scale humanannotated Chinese NER dataset with Textual and Acoustic contents, named CNERTA. Specifically, we annotate all occurrences of 3 entity types (person name, location and organization) in 42,987 sentences originating from the transcripts of Aishell-1 (Bu et al., 2017), a corpus that has been widely employed in Mandarin speech recognition research in recent years (Shan et al., 2019;Tian et al., 2020). In particular, unlike previous multimodal NER datasets (Moon et al., 2018;Lu et al., 2018) are all flatly annotated, not only the topmost entities but also nested entities are annotated in CNERTA.
Second, based on CNERTA, we establish a family of strong and representative baselines. In detail, we first investigate the performance of several classic text-only models on our dataset, including BiLSTM-CRF (Lample et al., 2016) and BERT-CRF (Devlin et al., 2019). Then, since introducing a lexicon has been proven as an effective way to incorporate word information in Chinese NER (Zhang and Yang, 2018), we implement several lexicon-enhanced models, such as Lattice-LSTM (Zhang and Yang, 2018) and ZEN (Diao et al., 2020), to explore whether the acoustic modality can provide word information beyond the lexicon. Finally, to verify the effectiveness of introducing the acoustic modality, we test some widely used multimodal models, such as CMA (Tsai et al., 2019) and MMI , on our dataset.
Third, upon these strong baselines, we further propose a simple Multi-Modal Multi-Task model (short for M3T) to make better use of the pause information in the acoustic modality. Specifically, different from coupling the visual modality with the textual modality, there is a monotonic alignment between the acoustic modality and the textual modality. Armed with such an alignment, the position of each Chinese character in the continuous speech would be determined, which would make it easy to discover pauses between adjacent words. Therefore, to automatically estimate this desired alignment, we introduce a speech-to-text alignment auxiliary task and propose a hybrid CTC/Tagging loss. In the hybrid loss, a masked CTC loss (Graves et al., 2006) is designed for enforcing a monotonic alignment between speech and text sequences.
The primary contributions of this work can be summarized as follows: • We construct CNERTA, the first humanannotated Chinese multimodal NER dataset, where each annotated sentence is paired with its corresponding speech data. To our best knowledge, this dataset is not only the largest multimodal NER dataset, but also the largest Chinese nested NER dataset.
• We establish a family of baselines to leverage textual features or multimodal features.
Through various experiments, we observe consistent performance boosts originating from acoustic features, which verifies the significant merits of integrating acoustic features for Chinese NER.
• We further propose a multimodal multitask method by introducing a speech-to-text alignment auxiliary task. By jointly solving the tagging task and the alignment task, the proposed method can yield SoTA results on CNERTA.

Related Work
Mutlimodal NER: As multimedia technology evolves, processing multimodal data is becoming a burning issue. As a basic NLP tool, multimodal NER attracts increasing attention in recent years. Most of studies on multimodal NER focus on leveraging the associate images to better identify the named entities contained in the text. Specifically, Moon et al. (2018) propose a multimodal NER network with modality attention to fuse textual and visual information. To model inter-modal interactions and filter out the noise in the visual context,  propose an adaptive co-attention network and a gated visual attention mechanism for multimodal NER. As transformer-based models (Vaswani et al., 2017;Devlin et al., 2019) become the mainstream method in NLP, researchers turn to study how to fuse visual clues in transformers structure.  use captions to represent images as text and adopt transformer-based sequence labeling models to connect multimodal information.  propose a Multimodal Transformer model, which empowers transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images. But different from them, we aim to explore an unexplored territory in this work, which is Chinese multimodal NER with both speech and textual contents.
Chinese NER: Compared with English NER, Chinese NER is more complicated since the written text in Chinese is not naturally segmented. Therefore, how to incorporate word information is the key challenge in Chinese NER. There are three main ways to fuse word information in Chinese NER. The first one is the pipeline method. In the pipeline method, Chinese word segmentation (CWS) is first applied and then a word-based NER model is used. The second one is to learn CWS and NER tasks jointly (Xu et al., 2013;Peng and Dredze, 2016b;Cao et al., 2018;Wu et al., 2019). In such a way, the word boundary information in the CWS task can be transferred to the NER model. The third one is to resort to an automatically constructed lexicon (Zhang and Yang, 2018;Ding et al., 2019;Liu et al., 2019a;Sui et al., 2019;Gui et al., 2019;Ma et al., 2020;Xue et al., 2020). Different from all previous studies, we focus on use speech clues to incorporate word information in Chinese NER.  Here, "Avg" denotes average, "Sent" denotes sentence, "Len" denotes length, "Prop" denotes proportion, "Ent" denotes entity and "#" denotes number.

Dataset Acquisition and Comparison
In this work, we aim to explore Chinese NER with both speech and textual clues. But we are not aware of any such existing corpus, hence we are motivated to collect one. In this section, we will discuss the data acquisition process, subsequently present statistics of the dataset and compare the annotated dataset with other widely-used NER datasets.

Dataset Acquisition
The main challenge in data acquisition is to find a large-scale dataset, which includes texts and the corresponding speech data. One possible way is to attach speech data to current existing Chinese NER datasets. However, it is costly to gather hundreds of participants in the recording. Therefore, we take a different way, manually annotating NER tags on a speech recognition dataset from scratch. In detail, our annotated dataset is based on Aishell-1 (Bu et al., 2017) dataset, which is a large-scale Mandarin automatic speech recognition dataset. In this dataset, text transcriptions are chosen from five domains: "Finance", "Science and Technology", "Sport", "Entertainments" and "News". There are 400 participants in the recording, and the gender of participants is balanced with 47% male and 53% female. Speech utterances are recorded via three categories of devices in parallel, which are a high fidelity microphone working at 44.1 kHz, 16-bit, Android phones working at 16 kHz, 16-bit, and Apple iPhones working at 16 kHz, 16-bit.
To ensure the quality of annotation, we design two rounds in the annotation procedure. In the first  round, we use Brat (Stenetorp et al., 2012) as the annotation tool and ask 3 internal annotators (including the first author of this paper) to perform annotation, who are very familiar with this task. They independently identify and classify named entities in the transcriptions with more than 17 characters. Cohen's kappa coefficient (Cohen, 1960) is used to measure the inter-annotator agreements.
After the first round, κ = 0.965, which shows the quality of CNERTA is satisfactory. But there are still some sentences for which annotators give out different annotations. For those sentences, the annotators check the disagreed annotations carefully and discuss to reach the agreements for all cases. After we finish the annotation process, we split the dataset into three parts: training, development, and test set. Table 1 shows the high level statistics of data splits for CNERTA.
From Table 2, we observe that our corpus has unique value compared with the existing datasets. The value is reflected in the following aspects: (1) CNERTA is a large-scale dataset; (2) CNERTA is the first Chinese multimodal dataset; (3) Not only the topmost entities but also nested entities are annotated; (4) Among these datasets, the acoustic modality is only introduced in CNERTA.

Task Description
Given a text X = x 1 , x 2 , ..., x n and its corresponding speech S = s 1 , s 2 , ..., s t , where x i denotes the i-th Chinese character and s j denotes the j-th waveform frame, the goal of the task is to leverage textual and speech clues to identify and classify all named entities contained in the text.

Nested Structure Linearization
Unlike flat NER, named entities may overlap and also be labeled with more than one label in nested NER. To solve nested NER, we follow Straková et al. (2019) to encode the nested entity structure into a CoNLL-like, per-character BIO encoding (Ramshaw and Marcus, 1995). There are two rules to guide the linearization: (1) entity mentions starting earlier have priority over entities starting later, and (2) for mentions with the same beginning, longer entity mentions have priority over shorter ones. A multilabel for a given Chinese character is a concatenation of all intersecting entity mentions, from the highest priority to the lowest. For more details, we refer readers to Straková et al. (2019).

Acoustic Encoder
The acoustic encoder is used to map raw speech signals into continuous space. There are three parts in the proposed acoustic encoder: a speech processing layer, a convolution front end and a transformerbased encoder.
Specifically, in the speech processing layer, a speech signal first goes through a pre-emphasis filter; then gets sliced into frames and a window function is applied to each frame; afterwards, a Short-Time Fourier transform (Kwok and Jones, 2000) is employed on each frame and the power spectrum is calculated; and subsequently, the filter banks (Ravindran et al., 2003) are computed. Then, we use a convolution front end to down-sample the long acoustic features. In the convolution front end, following Dong et al. (2018); Tian et al. (2020), two 3×3 CNN layers with stride 2 are stacked for both time and frequency dimensions. Afterwards, in order to enable the acoustic encoder to attend by relative positions, the positional encoding is added to the output of the convolution front end. Finally, to effectively capture long-term dependencies, down-sampled acoustic features flow through the transformer-based encoder (Vaswani et al., 2017). The transformer-based encoder is a stack of 6 identical layers, each of which is composed of a self-attention sub-layer and a feedforward network.

Baselines
Based on the annotated dataset, a family of strong and representative baselines is established, including (1) text-only models presented in Section 5.1, (2) lexicon-enhanced models shown in Section 5.2 and (3) multimodal models introduced in Section 5.3.

Text-Only Model
Open-Source NLP Toolkit: Many open-source NLP toolkits, such as spaCy (Honnibal et al., 2020) and Stanza , support Chinese NER. In spaCy, a multitask CNN is employed. In Stanza, a contextualized string representation based tagger from Akbik et al. (2018) is adopted. In both spaCy and Stanza, the tagger is trained on OntoNote (Weischedel et al., 2011). To map the output of taggers to CNERTA's label space, expert-designed rules are used, such as PERSON → PER. Since these toolkits are only designed for flat structure, we do not evaluate these toolkits in nested settings.

PLM-CRF:
Instead of training a model from scratch, we also adopt the framework of fine-tuning a pretrained language model (PLM) on a downstream task (Radford et al., 2018). In this framework, we adopt BERT (Devlin et al., 2019) as the textual encoder and use CRF as the decoder. In addition to initializing the textual encoder with the original pretrained BERT model, a SoTA Chinese pretrained language model, called MacBERT (Cui et al., 2020), is used. Compared with BERT, MacBERT is built upon RoBERTa (Liu et al., 2019b) and the original MLM task in BERT is replaced with the MLM as correction task. For more details, we refer readers to Cui et al. (2020).

Lexicon-Enhanced Model:
A drawback of the text-only methods mentioned above is that explicit word and word sequence information is not fully exploited, which can be potentially useful. With this consideration, we also adopt lexicon-enhance models to incorporate word lexicons. (1) Lattice-LSTM (Zhang and Yang, 2018) is a classic method that can encode a sequence of input characters as well as all potential words that match a lexicon. (2) ZEN (Diao et al., 2020) is a pretrained Chinese text encoder enhanced by an n-gram lexicon. In ZEN, n-gram contexts are extracted, encoded and integrated with the character encoder. For more details about Lattice-LSTM and ZEN, we refer readers to Zhang and Yang (2018) and Diao et al. (2020).

Multimodal Model
To leverage the acoustic modality, several multimodal models are introduced. In these models, fusion modules are built on the top of the acoustic encoder and the textual encoder, which are designed for capturing the interaction between the textual hidden representations X = [x 1 , x 2 , ..., x n ]; x i ∈ R d and the acoustic representations S = [s 1 , s 2 , ..., s t ]; s j ∈ R d . We present two representative fusion modules, which are Cross-Modal Attention (CMA) module (Tsai et al., 2019) and Multimodal Interaction (MMI) module .
Cross-Modal Attention Module (CMA): Given the textual hidden representations X ∈ R d×n and the acoustic representations S ∈ R d×t , we first employ a m-head cross-modal attention mechanism (Tsai et al., 2019), by treating X as queries, and S as keys and values: where CA i refers to the i-th head of cross-modal attention, and R d×d denote the weight matrices for the query, key, value and multi-head attention, respectively. Then, we stack the following sub-layers on top: where LN means layer normalization (Ba et al., 2016) and FFN means a fully connected feedforward network, which consists of two linear transformation with a ReLU activation (Nair and Hinton, 2010). Finally, the new textual representations F ∈ R d×n , which are enhanced by acoustic features, are fed into the CRF decoder to infer NER tags.
Multimodal Interaction Module (MMI): A stack of cross-modal attention layer mentioned above makes up the multimodal interaction module. Since the architecture of MMI is too complex and is not the core of this paper, we will not introduce it in the main text. For more details about MMI, we refer readers to .

Proposed Method
Previous multimodal methods ignore a natural monotonic alignment between the acoustic modality and the textual modality. To capture this alignment, we propose a multimodal multitask model, called M3T. The framework of the proposed method is shown in Figure 2. In the M3T model, we adopt the CMA module to fuse acoustic information into the textual representations. Besides, a CTC project layer is built upon the acoustic encoder, and the loss function is a combination of masked CTC loss and CRF loss. Specifically, through the CTC project layer, each acoustic representation s i ∈ R d is first mapped to the total size of model units (in this paper, the model unit is the Chinese character) and then is passed through a logit function: where W v ∈ R d×|V | and |V | is the total size of Chinese characters. Unlike automatic speech recognition, only the characters in the given text need to be aligned rather than the entire model units. Therefore, we only keep these rows unchanged, whose corresponding characters are contained in the given text, and fill the other rows in G ∈ R |V |×t with the value −∞. The masked tensor G is then fed into CTC loss. Finally, to jointly solve the tagging task and the alignment task, a hybrid loss of combining the masked CTC loss with the CRF loss is used: where λ is a hyperparameter.

Experiments
In this section, we carry out various experiments to investigate the effectiveness of introducing the acoustic modality. In addition, we empirically compare the proposed model and these baselines under different settings. Following previous studies in NER (Zhang and Yang, 2018), standard precision (P), recall (R) and F1-score (F1) are used as evaluation metrics.  Table 3: Precision (%) , Recall (%) and F1 score (%) of baselines and our proposed method on CNERTA. ∆ means the points higher than the corresponding baselines without using the acoustic modality.

LSTM-Based Baselines:
We use the 50dimensional character embeddings, which are pretrained on Chinese Giga-Word * using word2vec (Mikolov et al., 2013). The dimensionality of LSTM hidden states is set to 300 and the initial learning rate is set to 0.001. We train the models using 100 epochs with a batch size of 16.
Lexicon: The lexicon used in Lattice-LSTM is the same as Zhang and Yang (2018) and the lexicon used in ZEN is the same as Diao et al. (2020). Due to low speed in training and inference, we only employ Lattice-LSTM in unimodal settings.
Pretrained Language Model Fine-Tuning: We use the base models of BERT (Devlin et al., 2019), MacBERT (Cui et al., 2020) and ZEN (Diao et al., 2020). The initial learning rate of pretrained language model is set to 1 × 10 −5 . We fine-tune models using 10 epochs with a batch size of 16.
Computing Infrastructure: All experiments are conducted on an NVIDIA GeForce RTX 2080 Ti (11 GB of memory). * https://catalog.ldc.upenn.edu/ LDC2011T13 Table 3 shows the results of baselines and our proposed model on CNERTA. From the table, we find:

Main Results
(1) Introducing the acoustic modality can significantly boost the performance of the characterbased models, such as BiLSTM-CRF, BERT-CRF and MacBERT-CRF. With the simple CMA module to introduce the acoustic modality, there is a more than 1.6% improvement in both flat NER and nested NER. Furthermore, by using the M3T model to leverage the acoustic modality, a more than 3% improvement can be brought in all cases. These experimental results demonstrate the effectiveness of introducing the acoustic modality in characterbased NER models.
(2) Introducing the acoustic modality can improve the performance of lexicon-based models, such as ZEN-CRF. By introducing the acoustic modes in ZEN-CRF with the CMA module, the performance in flat NER and nested NER can be improved by 1.38% and 1.73%, respectively. Armed with the M3T model, the performance in flat NER and nested NER can be further improved by 2.93% and 3.19%. Although not as significant as the improvement of the character-based models, these   results still prove that the acoustic modality can provide lexicon-based models with some information that does not contain in the large-scale lexicon.
(3) Our proposed method (M3T) can achieve the SoTA results on CNERTA. Compared with CMA (Tsai et al., 2019) and MMI , there is a significant improvement. We conjecture that is due to that the monotonic alignment between the acoustic modality and the textual modality is captured by the masked CTC loss and armed with this alignment, precise word boundary information contained in speech is leveraged by the model.

Error Analysis
As NER models established here are not yet as accurate as one would hope, some analyses of the errors that occur in the output of NER models are performed. We divide the error into type error and boundary error. The type error is defined as that the boundary of the predicted entity is correct but the predicted type is wrong, and the other errors are classified as boundary errors. The statistics of boundary errors and type errors are shown in Table 5. From the table, we find that: (1) Errors are mainly caused by mistakenly locating boundaries of entities. Therefore, discovering entity boundaries is the main challenge in Chinese NER. (2) Leveraging the acoustic modality can effectively reduce boundary errors. In nested NER, the number of errors decreases from 906 to 848, totally owning to the reduction of boundary errors, but the number of type errors increases, which may be due to overfitting or some random factors.

Case Studies
To visually show the effectiveness of introducing the acoustic modality, case studies on comparing the output of BERT-CRF and BERT-M3T are present in Table 4. From the table, we can observe that: without the acoustic modality, BERT-CRF is prone to locate some ambiguous entities mistakenly, such as "沙特阿拉伯" (Saudi Arabia), "首都 机场"(Capital Airport), "国际米兰" (Inter Milan). But armed with the acoustic modality, these entities are located with complete accuracy. In the last case, BERT-M3T makes some mistakes. We listen to the corresponding audio clip and find that there is a long pause between "毕尔" and "巴鄂".

Conclusion and Future Work
In this paper, we explore Chinese multimodal NER with both textual and acoustic contents. To achieve this, we construct a large-scale manually annotated multimodal NER dataset，named CNERTA. Based on this dataset, we establish a family of baseline models. Furthermore, we propose a simple multimodal multitask method by introducing a speechto-text alignment auxiliary task. Through extensive experiments, we prove that Chinese NER models can benefit from introducing the acoustic modality and our proposed model is effective.
In the future, we are interested in mining other information contained in speech, such as rhythm, emotion, pitch, accent and stress, to boost NER. Meanwhile, we will also work on designing some speech-text pretraining tasks for building a largescale pretrained model with multimodal capabilities.