More than Text: Multi-modal Chinese Word Segmentation

Chinese word segmentation (CWS) is undoubtedly an important basic task in natural language processing. Previous works only focus on the textual modality, but there are often audio and video utterances (such as news broadcast and face-to-face dialogues), where textual, acoustic and visual modalities normally exist. To this end, we attempt to combine the multi-modality (mainly the converted text and actual voice information) to perform CWS. In this paper, we annotate a new dataset for CWS containing text and audio. Moreover, we propose a time-dependent multi-modal interactive model based on Transformer framework to integrate multi-modal information for word sequence labeling. The experimental results on three different training sets show the effectiveness of our approach with fusing text and audio.


Introduction
Word segmentation is a fundamental task in Natural Language Processing (NLP) for those languages without word delimiters, e.g., Chinese and many other East Asian languages (Duan and Zhao, 2020). In this paper, we mainly take Chinese language as investigating object, namely CWS. As we know, CWS has been applied as an essential preprocessing step for many other NLP tasks Qiu et al., 2020), such as named entity recognition, sentiment analysis, machine translation, etc.
In the literature, some popular approaches to CWS systems report a high performance at the level of 96%-98%, and these systems typically require a large scale of pre-segmented textual dataset for training. However, the collection of a specific scenario on such large scale is very time-consuming and resource-intensive, such as video monologues and audio broadcast. In these scenarios, there are multiple modalities: text, audio and vision, thus if only using the text seems not a good choice. For example, as shown in Figure 1, if we only read the text"必须不忘初心牢记使命" with no punctuation, it is not easy to make word segmentation immediately. However, if there is the acoustic information, we can observe the obvious stop in spectrum and sonic wave at the middle of "心" and "牢", which provides the facility for CWS. Therefore, in this paper, we propose to performing CWS with multi-modality, namely MCWS, by a time-dependent multi-modal interactive network. Specifically, we first collect a new dataset from an audio and video news broadcast platform and annotate the word boundaries of audio transcription text. Second, we make the text and the audio align as the time stamp of each character, then encode both modalities 1 by Transformer-based framework to capture the intra-modal dynamics. Third, we design a time-dependent multi-modal interaction module for each character step to generate the multi-modal hybrid character representation.
Finally, we leverage the CRF to perform sequence labeling on the basis of the above character representation.
We evaluate our approach on the newly annotated small-scale dataset with different size of training sets. The experimental results demonstrate that our approach performs significantly better than the single-modal state-of-the-art and the multi-modal approaches with early fused features of CWS.
2 Related Work Xu (2003) first formalize CWS as a sequence labeling task, considering CWS as a supervised learning from annotated corpus with human segmentation. Peng et al. (2004) further adopt standard sequence labeling tool CRFs for CWS modeling, achieving a best performance in their same period. Then, a large amount of approaches based on above settings are proposed for CWS (Li and Sun, 2009;Sun and Xu, 2011;Zhang et al., 2013).
Recently, deep neural approaches have been widely proposed to minimize the efforts in feature engineering for CWS (Zheng et al., 2013;Pei et al., 2014;Chen et al., 2015;Cai and Zhao, 2016;Zhou et al., 2017;Yang et al., 2017;Ma et al., 2018;Wang et al., 2019a;Fu et al., 2020;Ding et al., 2020;Tian et al., 2020a;. Among these studies, most of them follow the character-based paradigm to predict segmentation labels for each character in an input sentence. To enhance CWS with neural models, there were studies leverage external information, such as vocabularies from auto-segmented external corpus and weakly labeled data (Wang and Xu, 2017;Higashiyama et al., 2019;Gong et al., 2020).
To our best knowledge, we are first to perform CWS with multi-modality, which can deal with multi-modal scenarios and offers an alternative solution to robustly enhancing neural CWS models.

Data Collection and Annotation
We collect the multi-modal data for CWS from a Chinese news reporting platform "Xuexi" 2 . We mainly focus on the audios equipped with machine transcription text. In total, we crawl 120 short videos and segment them into about 2000 sentences. To avoid the contextual influence and augment the robust of designed computing model, we randomly 2 https://www.xuexi.cn/ select 250 sentences to annotate the word boundaries, and the remaining data are used to perform semi-supervised or unsupervised learning in the future.
We annotate these Chinese audio transcriptions following the CTB word segmentation guidelines by Xia (2000). Two annotators are asked to annotate the data. Due to the clear annotation guideline, the annotation agreement is very high, reaching 98.3%. The disagreement instances are judged by an expert. The statistics of our annotated data are summarized in Table 1.

Time-dependent Multi-modal
Interactive Network for CWS In this section, we introduce our proposed multimodal approach for CWS, namely Time-dependent Multi-modal interactive Network (TMIN), which can capture the interactive semantics between text and audio for better word segmentation. This approach mainly consists of three modules: timedependent uni-modal interaction, time-dependent multi-modal interaction and CRF labeling. Figure  2 shows the overall architecture of our TMIN.

Time-dependent Uni-modal Interaction
To better capture the temporal correspondences between different modalities (Zhang et al., 2019;, we first align two modalities by extracting the exact time stamp of each phoneme and character using Montreal Forced Aligner (McAuliffe et al., 2017). For machines to understand human utterance, they must be first able to understand the intramodal dynamics (Zadeh et al., 2018;Wang et al., 2019b;Tsai et al., 2019) in each modality, such as the word order and grammar in text, breathe and

Transformerbased Unit
Transformerbased Unit tone in audio. Textual Modality. We use BERT (Devlin et al., 2019) as encoder to perform intra-modal interactions and obtain the contextual character representation. Then, each character of text transcripts can be represented as: X = (x 1 , x 2 , · · · , x n ) ∈ R n×d 1 .

Time-dependent Multi-modal Interaction
To better capture the cross-modal semantic correspondences (Wu et al., 2020;, we design a long-and short-term hybrid memory gating (LSTHMG) block, which is a extension of standard LSTM.
We first obtain the current memory of each character-level representation for both modalities.
where LSTM denotes the standard LSTM (Graves et al., 2013). After current updating, we employ multiattention to control the different contributions of each hidden state.
where MA denotes the multi-attention gating mechanism, which is considered to mine multiple potential dimension-aware importance for each modality (Zadeh et al., 2018).ĥ i ∈ R (d 1 +d 2 )×1 is the unsqueezed concatenation ofĥ x i andĥ a i . L denotes the max times for attentions. The query Q l , key K l and value V l at the l-th time are defined similarly to self-attention (Vaswani et al., 2017): Note that h i denotes the sum of L times attentional state concatenation for multi-modal representation at character-level step i, which is then used to perform word segmentation by CRF. Besides, we split each part for each modality as its own dimension: h x i and h a i , and input them into the next LSTHMG step.

CRF Labeling
Since the textual and acoustic semantics of each character have been integrated by time-dependent uni-modal and multi-modal interactions, we allow h i to perform conditional sequence labeling. Instead of decoding each label independently, we model them jointly using a CRF to consider the correlations between labels in neighborhoods. Formally, where S i (y i−1 , y i ,X) and S i (y i−1 , y i ,X) are potential functions.X denotes the input of CRF. Y denotes the output label space. We use the maximum conditional likelihood estimation for CRF training. The logarithm of likelihood is given by: i logp(y|X). In the inference phase, we predict the output sequence that obtains the maximum score given by: argmax y ∈Y p(y|X).

Experimentation
In this section, we provide the exploratory experimental results and a case analysis.

Experimental Setting
Data Split. We evaluate our approach on the different size of training sets and the same validation set and test set, i.e., 50, 100 and 150 sentences for training, the remaining 50 and 50 sentences for validation and test, respectively. For different training sets, the Out-of-vocabulary (OOV) rate in test set is 92.89%, 46.73% and 30.93%, respectively. Implementation Details. The character embeddings of text X are initialized with the cased BERT base model pre-trained with dimension of 768, and fine-tuned during training. The characterlevel embeddings of audio A are encoded by Transformer with dimension of 124. The learning rate, the dropout rate, and the tradeoff parameter are respectively set to 1e-4, 0.5, and 0.5, which can achieve the best performance on the development set of both datasets via a small grid search over the combinations of [1e-5, 1e-4], [0.1, 0.5], and [0.1, 0.9] on two pieces of NVIDIA GTX 2080Ti GPU with pytorch 1.7. Based on best-performed development results, the Transformer layers for audio encoding and the multi-attention times L in gating is set 2 and 4, respectively. To motivate future research, the dataset, aligned features and code will be released 3 .
Baselines. For a thorough comparison, we implement the following approaches with F1 as metric: 1) BERT and CRF framework, BC: BC(Text) , BC(Audio), and BC(Text+Audio). 2) A representative state-of-the-art, WMSEG (Tian et al., 2020b): WMSEG(Text), WMSEG(Audio), and WMSEG(Text+Audio). Note that the approaches with (Text) take character-level text as input, the approaches with (Audio) take character-level audio as input, and the approaches with (Text+Audio) take character-level concatenation of text and audio as input. Table 2 shows the performance of different baselines compared with our approach, where the overall F-score and the recall of OOV are reported. From this table we can see that:

Main Results
1) WMSEG performs much better than the general framework BC. This indicates that it is effective for WMSEG to incorporate wordhood information with several popular encoder-decoder com- binations and it is suitable as a competitive baseline.
2) The approach with only audio perform significantly worse than the approaches with text only, suggesting that it is confusing of the various acoustic features and we should utilize audio modality properly.
3) In most cases, the baselines with both text and audio bring in a poor performance compared with uni-modal approach, which suggests that simply concatenation of time-dependent character-level features for CWS seems a bad choice. 4) Among all approaches, our TMIN performs best, and significantly outperforms the competitive baselines (p-value< 0.05). Moreover, with regard to R oov , we can observe that our TMIN is able to recognize new words more accurately. This is mainly because our approach can obtain effective multi-modal information by time-dependent fusion against only textual, acoustic or early fused approaches. Figure 3 illustrates a real instance of the predicted boundaries by different approaches. From this figure, we can see that both WMSRG and BC give the wrong prediction of the boundary in "史" and "性" though they determine the correct segmentation for "历史" and "成就". However, our TMIN achieves all exact segmentation of this instance. This is mainly because it is very effective for audio, where there are a continuous breathing in the character " 性", thus "历史性" is a complete word.

Conclusion and Future Work
This paper proposes a new dataset for multi-modal Chinese word segmentation (MCWS), which is the first attempt to explore the multi-modality for traditional CWS. Meanwhile, we propose a time-dependent multi-modal interactive network (TMIN) to effectively integrate textual and acoustic features. The preliminary experimental results and case analysis demonstrate the reliability of our motivation and the effectiveness of the proposed approach.
In the future, we will annotate more samples at the current setting, and collect new samples with more modalities, such as visual information in social media, monologues and dialogues with continuous front face. Moreover, we will employ the neural active learning approaches for MCWS to reduce the annotation and achieve the best performance.