Input Representations for Parsing Discourse Representation Structures: Comparing English with Chinese

Neural semantic parsers have obtained acceptable results in the context of parsing DRSs (Discourse Representation Structures). In particular models with character sequences as input showed remarkable performance for English. But how does this approach perform on languages with a different writing system, like Chinese, a language with a large vocabulary of characters? Does rule-based tokenisation of the input help, and which granularity is preferred: characters, or words? The results are promising. Even with DRSs based on English, good results for Chinese are obtained. Tokenisation offers a small advantage for English, but not for Chinese. Overall, characters are preferred as input, both for English and Chinese.


Introduction
Recently, sequence-to-sequence models have achieved remarkable performance in various natural language processing tasks, including semantic parsing (Dong and Lapata, 2016;Jia and Liang, 2016;Konstas et al., 2017;Dong and Lapata, 2018), the task of mapping natural language to formal meaning representations ( Figure 1). In this short paper we focus on parsing Discourse Representation Structures (DRSs): the meaning representations proposed in Discourse Representation Theory (DRT, Kamp and Reyle, 1993), covering a large variety of linguistic phenomena including coreference, thematic roles, presuppositions, scope, quantification, tense, and discourse relations.
Several data-driven methods based on neural networks have been proposed for DRS parsing (van Noord et al., 2018b(van Noord et al., , 2019Liu et al., 2019a;Evang, 2019;Fancellu et al., 2019;Fu et al., 2020;van Noord et al., 2020). These approaches frame semantic parsing as a sequence transformation problem and map the target meaning representation to string format. These models learn the meaning of a series of semantic phenomena by taking sentences as input and directly outputting the corresponding DRSs, without the aid of any extra linguistic information (such as part-of-speech or syntactic structure). These previous studies have achieved good results, but have mostly focused on English or other languages that use the Latin alphabet.
Our objective is to investigate whether the same method is applicable to Mandarin Chinese, an extremely analytic language which makes deep parsing challenging (Levy and Manning, 2003;Yu et al., 2011;Tse and Curran, 2012;Min et al., 2019). But Chinese is not only different on the level of syntax; its writing system also shows large differences with English, as there are no explicit word separators in written Chinese, and there is no distinction between lower-and upper case characters. Unlike English, Chinese words comprise few characters, but the number of different characters is about two orders of magnitude higher than that of English.
These orthographic differences are interesting in the context of previous work, as van Noord et al. (2018b) use character-level input and word-level input to compare the impact of different input representations on DRSs parsing for English, finding that the character-level representation obtained better performance. In this paper we want to investigate how Chinese fits in this picture. To the best of our knowledge, we are the first to explore methods for Chinese DRS parsing. We aim to answer the following questions: 1. Can existing DRS parsing models achieve good results for Chinese? (RQ1) 2. Given the different writing systems used for English and Chinese, which input granularity is best for either language? (RQ2) 3. Is rule-based word segmentation (tokenization) beneficial for Chinese DRS parsing?
This paper is organised as follows. First we provide a short background on the formal meaning representations that we use, the difference between the writing systems of English and Chinese, and the issues that arise around characters and words. Then we will introduce our approach, the data set that we use, and how we conduct our experiments. In the final section we show that we can achieve good results for Chinese DRS parsing, with characters as the preferred representation.

Background
Representing Meaning DRT proposes DRSs to represent the meaning of sentences and short texts. An impressive repertoire of semantic phenomena is covered by DRT, including quantification, negation, reference resolution, comparatives, discourse relations, and presupposition. We use the DRS version as employed in the Parallel Meaning Bank (Abzianidze et al., 2017), where concepts (triggered by nouns, verbs, adjectives and adverbs) are represented by WordNet synsets (Fellbaum, 1998), and semantic relations by Verbnet roles (Kipper et al., 2008). DRSs can be represented in box format or clause format (see Figure 1), where x, e, s, and t are discourse referents denoting individuals, events, states, and time, respectively, and b is used for variables denoting DRSs. Named entities are preserved from the original language used in the input, so names in Chinese are literally transferred in the DRS interpretation (see Figure 1). This means that the only difference between English and Chinese DRSs is the way names are represented: English orthography is used for proper names in English DRSs; Chinese characters are used for names in the corresponding Chinese DRSs. The box format has become a common representation of DRSs because of its convenient reading and intuitive understanding. The clause format is a flat version of the standard box notation, which represents DRSs as a set of clauses. Due to its simple and flat structure it is more suitable for machine learning purposes. At the same time, however, the structure of DRSs poses a challenge to sequence-tosequence models, because they need to be able to generate the well-formed recursive semantic structures.
Chinese Word Segmentation Differently from English, Chinese words are not separated by white spaces, as shown in Table 1. The first step of a typical Chinese NLP task is usually to use separators to mark boundaries at appropriate positions to identify words in a sentence. These words define the basic semantic units of Chinese. This process, i.e., Chinese word segmentation (Lafferty et al., 2001;Xue, 2003;Zheng et al., 2013;Cai and Zhao, 2016;Min et al., 2019), is a fundamental step for many Chinese NLP applications, which directly affects downstream performance (Foo and Li, 2004;Xu et al., 2004). Despite the large body of existing research, the quality of Chinese word segmentation remains far from perfect, because many characters are highly ambiguous.
Input Formats for Neural Methods Characterlevel representations have proved useful for neural network models in many NLP tasks such as POS-tagging (Santos and Zadrozny, 2014;Plank et al., 2016), dependency parsing (Ballesteros et al., 2015) and neural machine translation (Chung et al., 2016). However, only a few studies have used character-level representations as input representations for Chinese NLP tasks (Yu et al., 2017;Li et al., , 2019Min et al., 2019). For Chinese semantic parsing, previous studies mostly used wordbased representations as well (Che et al., 2016;Wang et al., 2018). For English DRS parsing, how-

Type
English input representation Chinese input representation Table 1: Input representations for the English sentence Brad Pitt is an actor and its Chinese translation (布拉德.皮 特是个演员). Note that raw and continuous character representations are identical in Chinese. Char (tokenized) adds explicit word boundaries after tokenizing the text. The symbol | represents a word boundary, while the symbolˆrepresents a shift to uppercase. ever, van Noord et al. (2018b) showed that a bi-LSTM sequence-to-sequence model with characterlevel representations outperformed word-based representations, as well as a combination of words and characters. This will be the starting point of our exploration of Chinese DRS parsing.

Methodology
Annotated Data We use data from the Parallel Meaning Bank (PMB 3.0.0, Abzianidze et al., 2017). The documents in this PMB release are sourced from seven different corpora from a wide range of genres. For one of these corpora, Tatoeba, Chinese translations already exist, and we added them to the PMB data. For the remaining texts that had no Chinese translation, we translated the English documents into Chinese using the Baidu API, manually verified the results and, when needed, corrected the translations. Only a few translations needed major corrections. About a hundred translated sentences lacked past or future tense or used uncommon Chinese expressions. Special care was given to the translation of named entities, ambiguous words, and proverbs, and required about a thousand changes. For economical reasons the silver part of the data was only checked on grammatical fluency.  Chinese Meaning Representations We start from the English-Chinese sentence pairs with the DRSs originally annotated for English. Interestingly, the DRSs in the PMB can be conceived as language-neutral. Even though the English Word-Net synsets present in the DRS are reminiscent of English, they really represent concepts, not words. Similarly, the VerbNet roles have English names, but are universal thematic roles. An exception is formed by named entities, that are grounded by the orthography used in the source language. In sum, we assume that the translations are, by and large, meaning preserving, and project English to Chinese DRSs by changing all English named entities to Chinese ones as they appeared in the Chinese input (see Figure 1). This semantic annotation projection method bears strong similarities and is inspired by Damonte et al. (2017) and Min et al. (2019).

Input Representation Types
We consider five types of input representations, outlined in Table 1: (i) raw characters, (ii) continuous characters (i.e., without spaces), (iii) tokenised characters, (iv) tokenised words, and (v) byte-pair encoded text (BPE, Sennrich et al., 2016). Note that for Chinese, the first two options amount to the same kind of input. For BPE, we experiment with the number of merges (1k, 5k and 10k) and found in preliminary experiments that it was preferable to not add the indicator "@" for Chinese. For English character input we use an explicit "shift" symbol (ˆ) to indicate uppercased characters, to keep the vocabulary size low. Moreover, the | symbol represents an explicit word boundary. For tokenisation we use the Moses tokenizer (Koehn et al., 2007) for English, while we use the default mode of the Jieba tokenizer 2 to segment the Chinese sentences. To fairly compare these different input representations, we do not employ pretrained embeddings.
Output Representation Appendix B shows how DRSs are represented for the purpose of training neural models, following van Noord et al. (2018b). Variables are replaced by indices, and the DRSs are coded in either a linearised character-level or wordlevel clause format. For Chinese, we experimented with both representations and found that the output representation had little effect on parsing performance. To follow previous work (van Noord et al., 2018b) and to allow a fair comparison between the languages, we therefore use the character-level DRS representation for both languages.
Data Splits We distinguish between gold (manually corrected meaning representations) and silver (automatically generated and partially corrected meaning representations) data. There are a total of 8,403 English-Chinese documents with gold data, of which 885 are used as development set and 898 as test set. The silver data (97,597 documents) is only used to augment the training data, following van Noord et al. (2018b). We use a fine-tuning approach to effectively use high-quality data in our experiments: first training the system with silver and gold data, then restarting the training to finetune on only the gold data.
Neural Architecture We use a recurrent sequence-to-sequence neural network with two bi-directional LSTM layers (Hochreiter and Schmidhuber, 1997) as implemented by Marian    Table 3 shows the average of five runs for each input representation type. Generally, performance on English is significantly better than on Chinese, which is not surprising as the DRSs are based on English input using English WordNet synsets as concepts (see Figure 1). Given the situation, it is remarkable that Chinese reaches high scores given the differences between the languages in how they convey meaning (Levy and Manning, 2003). In general, F-scores start to decrease when sentences get longer (Figure 2), though there is no clear difference between the character and wordlevel models. This is in line with the findings of van Noord et al. (2018b). For English, the input types based on characters outperform those based on words. BPE approaches character-level performance for small amounts of merges (1k), but never surpasses it. This too is in line with van Noord et al. (2018b), but also with previous work on NMT for Chinese (Li et al., 2019). There is a small benefit (0.5) for tokenizing the input text before converting the input to character-level format, though the continuous character representation also works surprisingly well. For Chinese, characterbased input shows the best performance too, though for a very small amount of merges BPE obtains a similar score. As opposed to English, tokenizing the Chinese input is not beneficial when using a character-level representation, though it also does not hurt performance. In general, character-level models seem the most promising for Chinese DRS parsing. Similar results were obtained by Min et al. (2019) for Chinese SQL parsing.  Figure 3: F-scores per clause type (DRS operators, VerbNet roles and WordNet concepts) and concept type (nouns, verbs, adjectives, adverbs and events) as introduced by van Noord et al. (2018b). Reported scores are on the Chinese and English dev set for the raw character-level models, averaged over 5 runs. Figure 3 shows detailed scores for the characterbased (raw) model on the Chinese and English dev set, categorizing operators (e.g., negation, presupposition or modalities), VerbNet roles (e.g., Agent, Theme), predicates, and senses. Modifiers, especially adverbs, get a systematic lower score in Chinese compared to English. This is interesting, and examination of the data reveals that English adverbs are regularly translated as Chinese noun phrases (e.g., slightly → a little). This will lower the F-score even though the meaning is preserved, only expressed in a semantically different way.

Conclusion and Future Work
DRS parsing for Chinese based on projecting meaning representations from English translations gives remarkable performance (RQ1), though Chinese adverbs remain challenging. English results outperform those of Chinese, but it is likely that this is due to the general bias of the meaning representations towards English. Similar as for English, we find that characters are the preferred input representation for Chinese (RQ2). Surprisingly, for English, good results are even obtained by using characters without spaces as input. Tokenisation (segmenting the text into words) of the input offers a small advantage for English, but not for Chinese (RQ3), though it will be interesting to experiment with higher quality word segmentation systems (Higashiyama et al., 2019;Tian et al., 2020).
There are many research directions one could take next. One is to include pre-trained models. For instance, we could use recently proposed pretrained models such as BART  or mBART  to improve parsing performance. Another interesting idea is, rather than assuming the English WordNet as a background ontology for concepts in the DRS, using concepts based on Chinese WordNet or multilingual wordnets (Wang and Bond, 2013;Bond and Foster, 2013). Both possibilities will likely further improve performance of semantic parsing for Chinese and inspire research for developing semantic parsing models for languages other than English.  Table 4 gives an overview of the hyperparameters we experimented with in the tuning stage. The hyperparameters of the bi-LSTM model are mostly taken from van Noord et al. (2018b), but tuned on the Chinese development set. The hyperparameters of the Transformer model were randomly selected, and then also tuned on the Chinese development set. We also experimented with the hyperparameter selection of Liu et al. (2019b) for the Transformer model, but did not get the desired results.

A Hyperparameters
Fine-tuning We first train the models on gold + silver data for 15 epochs, then we restart the training process from that checkpoint to fine-tune on only the gold data for 30 epochs.