What If Sentence-hood is Hard to Define: A Case Study in Chinese Reading Comprehension

Machine reading comprehension (MRC) is a challenging NLP task for it requires to carefully deal with all linguistic granularities from word, sentence to passage. For extractive MRC, the answer span has been shown mostly determined by key evidence linguistic units, in which it is a sentence in most cases. However, we recently discovered that sentences may not be clearly defined in many languages to different extents, so that this causes so-called location unit ambiguity problem and as a result makes it difficult for the model to determine which sentence exactly contains the answer span when sentence itself has not been clearly defined at all. Taking Chinese language as a case study, we explain and analyze such a linguistic phenomenon and correspondingly propose a reader with Explicit Span-Sentence Predication to alleviate such a problem. Our proposed reader eventually helps achieve new a state-of-the-art on Chinese MRC benchmark and shows great potential in dealing with other languages.


Introduction
Machine reading comprehension (MRC) is a task that requires models to answer a question according to a given passage. This is a challenging task for it demands to carefully deal with all linguistic granularities from word, sentence to passage . For extractive MRC as the focus of this paper, the answer span has been shown mostly determined by key evidence linguistic units, in which it is a sentence in most cases (Zhang et al., 2020a). However, we recently found that sentences may be not clearly defined in many languages to different extents, so that this causes so-called location unit ambiguity problem to let model more difficultly determine which sentence exactly contains the answer span when sentence itself has not been clearly defined at all. In detail, sentence may include multiple clauses like English, or it consists of a series of sub-sentences like Chinese, where all sub-sentences share the same subject, predicate or object . When a language has relatively strict grammar means to determine the boundaries of sentence constituents such as clauses or sub-sentences, it will facilitate MRC models to more conveniently focus on a certain range of text for finding answer span. Otherwise, there comes an obvious so-called location unit ambiguity problem to hinder the performance of extractive MRC.
In the following, we take Chinese language as a case study to explain and analyze such a linguistic phenomenon and correspondingly find a solution. For the characteristics of Chinese, "In terms of sentence structure, English is determined by rule, while Chinese is determined by man" (Wang, 1984), that is, English focuses more on syntax while Chinese focuses more on semantics. A full long English sentence has to be subject to strict grammatical means so that clauses can be clearly identified, while in Chinese, such a long sentence may be written in a loose way, typically, whose subject may be conveniently omitted for all later sub-sentences, so that the boundaries between sentences and subsentences are blurred. As a result, there are more independent short sentences in Chinese which may be equally written as a single grammar-rigorous long one in English (Li and Nenkova, 2015;Zhao et al., 2017;Duan and Zhao, 2020).
As shown in a Chinese MRC example in Figure 1, the completely paraphrased sentence to answer the question is given in a series of short sub-
Their eggs are yellow, with 20~30 eggs at once laying, which will hatch after about 10 days.

拉迪氏鱼产卵期2~3个月， ， 约10日孵化。
Ladigesi's laying period lasts for 2~3 months. Figure 1: An example of location unit ambiguity of Chinese MRC models compared with English. The main alignment between two languages is marked in orange. P and G refer to predicted span and ground truth answer span, respectively. sentences in Chinese, which are connected in discourse relation but relatively independent in syntax. Actually, we provide two groups of English translations in Figure 1, in which the same Chinese 'long sentence' may be accurately translated into either a series of short sentences (in small font) or a strictly well-formed long sentence (in big font). In addition to flexible word order, Chinese expressions tend to adopt ellipsis for every possible constituent including the shared subject, leading word or conjunctions, which makes it much more difficult to identify a strictly-defined long sentence in Chinese than in English . Thus assuming that there is a implicit locating process of answer-related sentence before extracting the answer span, English MRC models may easily locate the complete answer-related sentence (right part of Figure 1), while Chinese MRC models may face the location unit ambiguity (left part of Figure 1), ignoring some needed subsentences (omitting) or focusing on unrelated ones (surplus). Such specific difficulty in Chinese MRC essentially requires a mechanism that is capable of teaching the model to locate exact answer-related sentences in an explicit way.
In this paper, we intend to discover if this sentence definition difficulty caused location unit ambiguity can be solved well and take a case study on Chinese extractive MRC. The basic form of extractive MRC is requiring models to extract a text span out of the passage to answer the question, given a passage, question pair, such as SQuAD1.1 (Ra-jpurkar et al., 2016), NQ (Kwiatkowski et al., 2019) and CMRC 2018 . There are also some other variants: SQuAD2.0 (Rajpurkar et al., 2018), CoQA (Reddy et al., 2019), HotpotQA (Yang et al., 2018), etc. The mainstream scheme of existing models is modeling extractive MRC as a token-level task, that is, to predict the probability of each token as a start/end span, so as to extract the most suitable answer span .
Specifically, we propose ESPReader (Reader with Explicit Span-sentence Predication), applying the proposed extra explicit span-sentence predication (ESP) subtask to help model locate the answercontained sentences more precisely. ESP is automatically constructed from the original span extraction dataset, which enables the model forcedly to locate the sentence containing the answer span in an expicit way. ESP will be jointly trained with the original token-level task. Our model uses selfattention to acquire answer-aware sentence-level representations from ESP and then fuses them with the original token-level representations from encoder by cross-attention for better span extraction.
Our contribution is summarized as follows: • To our best knowledge, we are the first to report the sentence definition ambiguity in human language together with its negative impact over MRC task.
• Our proposed ESP can be automatically constructed from the original corpus without extra human tagging.
• Experiments verify the performance and generality of our proposed model, and a new stateof-the-art on base-level models is achieved.
2 Related Work

Machine Reading Comprehension
Machine reading comprehension (MRC) is one of the main research directions of natural language processing (NLP). MRC tasks aim at testing machine's comprehension of natural language by requiring to answer questions given a relative passage (Hermann et al., 2015;Zhang et al., 2020d), whose types mainly include cloze (Hill et al., 2015;Cui et al., 2016), multi-choice (Lai et al., 2017;Sun et al., 2019) and span extraction (Rajpurkar et al., 2016;Reddy et al., 2019). In this paper, we focus on Chinese MRC of the last style. MRC tasks have made great progress and there appeared many models with great performance: Read+Verify , RankQA (Kratzwald et al., 2019), SG-Net (Zhang et al., 2020c), SAE (Tu et al., 2020), Retro-Reader , etc. Among them Reddy et al. (2020) aimed at resolving the partial matched problem in English span extraction tasks, which is close to our model design and task purpose for Chinese. Their solution is constructed as a two-stage model that first locates the initial answer, and then marks it in the raw passage and redoes the reading process. Differently, our method is a fully end-to-end model with a special model design which enables model to learn accurate locations of span-sentences.

Multi-grained and Hierarchical Models
To handle the location unit ambiguity, our model quotes a middle-level improvement design, thus our research has some correlation with multigrained and hierarchical models (Choi et al., 2017;Wang et al., 2018;Luo et al., 2020). Shen et al. (2018) proposed a multi-grained approach combining character-level, word-level and relation-level for text embeddings.  proposed a claim verification framework based on hierarchical attention neural networks to learn sentencelevel evidence embeddings to obtain claim-specific representation. All the above works used lowlevel semantic information to obtain high-level semantic representation, which is different from our intent of using sentence-level information to assist token-level task. Zhang et al. (2020a) proposed a hierarchical network that chooses top K answer-related sentences from the given passage scoring by cosine and bilinear scores to build a new passage for further multi-choice tasks. Their work is somewhat similar to our method. However, we let model directly locate the answercontained sentence, and use this sentence-level information for further token-level span extraction by cross-attention instead of straightly discarding other lower scoring sentences.

Our Proposed Model
As shown in Figure 2, our proposed Reader with Explicit Span-sentence Predication (ESPReader) consists of three modules, that is PrLM encoder, sentence-level self-attention layer and fusion crossattention layer. The details will be given below.
Explicit Span-sentence Predication To enhance the model with the capacity of locating the answer-related sentences more precisely, an explicit span-sentence predication (ESP) is proposed as a sentence-level subtask. For the sake of the integrity of sentence structure and content, paragraphs are divided into natural sentences by ending punctuation (",", ".", "?", and "!") other than a fixed length. After such segmentation, sentence containing the answer span will be labeled as a spansentence. During training, our model is required to explicitly locate the span-sentence while extracting answer span, which may alleviates the location unit ambiguity issue as span-sentence boundaries have been annotated according to the least sub-sentence segmentation among punctuations.  Since an answer span may stride over multiple sentences, we model the ESP subtask as a form of predicting the location of the start/end sentence (or sub-sentence), which is consistent with the form of original span extraction task.

Sentence Position Embedding
We sum up four embeddings including sentence position embedding E s (see Appendix A for details about E s ) as input to let the PrLM encoder yield representations as: and E t are respectively word embedding, position embedding (the token's offset in the whole input sequence) and token type embedding (the token belongs to question or passage), respectively. Reimers and Gurevych (2019) found that using mean of the output vector of the last layer of PrLM as sentence representation outperforms the overall representation according to [CLS] token marginally. Li et al. (2020a) claimed that using the average of the last two layers as the sentence embedding is better and mapping it to the standard Gaussian latent space can further eliminate the uneven problem of embedding space caused by word frequency difference.

Sentence-level Representation
Taking both experimental effectiveness and model simplicity into consideration, we use the average of last layer's output of PrLM for all tokens H t = {h 1 t , h 2 t , ..., h n t } in the corresponding sentences as the sentence-level representation S: where sp i and n i are the start position offset and length of sentence i , respectively.
Sentence-level Self-attention Layer In terms of the PrLM encoded sentence representations, we apply multi-head attention mechanism (Vaswani et al., 2017) to calculate the self-attention between sentences, as follows: where A i s is the sentence-level attention score of head i . D is the total number of heads.
Next,H s will be passed through a feed-forward layer followed by GeLU activation (Hendrycks and Gimpel, 2016), and then passed through the residual layer and layer normalization to get the final sentence-level output H s = {h 1 s , h 2 s , ..., h m s }. To predict the start/end sentence, we use a linear layer with softmax layer to obtain the probability of each sentence as a start/end sentence separately: where Linear is a linear transformation of d h → 2.
Cross entropy loss is used as our training object: L s = y ss logs s + y se logs e where y ss and y se are the ground truth label vectors of start/end sentence. Thus, H s will be guided as answer-aware sentence-level representations.

Fusion Cross-attention Layer
To integrate sentence-level information for span extraction, we conduct cross-attention between the output of encoder H t and the output of sentence-level selfattention layer H s . The calculation is almost the same as Eq. (2), except that the sources of vectors Q, K and V differ: Q comes from H t , while K and V come from H s : where W Q,i F ,W K,i F ,W V,i F are all learnable parameters matrices as Eq. (3). The remaining calculation process is exactly the same as the sentence-level self-attention layer. Through fusion cross-attention layer, the token-level fusion output F t which is injected with answer-aware sentence-level representations is obtained.
Finally, a manual weight α is used to aggregate F t and the original encoder output H t to get the final token-level output: H t will be applied to make start/end span predictions t s and t e as: Equally, cross entropy is used as the token-level loss function: L t = y ts logt s + y te logt e where y ts and y te are the ground truth label vectors of start/end span.

Training and Prediction
During the training phase, we will jointly learn span extraction and ESP, and the final loss is: where β is a manual weight. During the prediction phase, we only make start/end span prediction. The straightforward scoring function is: where i, j are the start and end token position, respectively (0 ≤ i ≤ j ≤ n). Considering that ESPReader is forced to pay more attention to whole sentences by adding the proposed ESP subtask, which might result in a length growth in predicted span, we design a scoring function with inverse length factor (ILF). Note that the span length is not exactly the shorter the better. It only works in this way when two sentences are with the similar length for the sake of reducing redundancy. Taking all these into account, our adopted scoring function is as follows: where l is the average length of all candidate answer spans. µ is a manual weight. When the span length is close to the average, ILF will assign some inhibitory effect on long spans. See Appendix B for a more concrete impression on ILF.

Setup
In our ESPReader implementation, we adopt well trained Chinese PrLMs as the encoder. Meanwhile, for each adopted PrLM, we add a one-layer MLP on its top which directly predicts start/end positions of answer span as the default reader to form baseline models for comparison. We consider three Chinese PrLMs, MacBERT (Cui et al., 2020) which helps achieve the current state-of-the-art on CMRC 2018, Chinese versions of BERT (base 2 ) and RoBERTa (base 3 and large 4 ).
Our hyperparameters are in Appendix C.
1 There is an extra small Challenge set in CMRC 2018. It especially checks the capability of model reasoning, which is beyond the topic of this work which focuses on the location unit ambiguity. We test our model on this set, which shows an Avg score drop of more than 1%. It is unsurprising and explainable since our ESP task, which forces to locate a certain sentence, might do little help to model's reasoning ability among multiple sentences.
2 bert-base-chinese 3 chinese-roberta-wwm-ext 4 chinese-roberta-wwm-ext-large Table 2 shows the experimental results on CMRC 2018. As can be seen, compared with baselines, our proposed model achieves significant 5 EM and F1 scores improvements on both base-level and largelevel models, especially EM scores, with an overall average increase of more than 2%. Moreover, it is noticed that our ESPReader on MacBERT base achieves a new state-of-the-art on CMRC 2018 leaderboard 6 of base-level models by gaining a comparable F1 score to MacBERT large , and even outperforming it on EM score on both Dev and Test sets. Besides, ESPReader on MacBERT large also gains the highest EM and average scores among all published work.

For Different Types Chinese MRC Tasks
To validate the generality of our method, we further test ESPReader on other two different types of Chinese MRC tasks, DRCD (Shao et al., 2018) and CJRC (Duan et al., 2019) (see Appendix D for dataset details). As shown in Table 3, ESPReader obtains visible increase on both datasets compared with our baselines. 5 we make the McNemar's test (McNemar, 1947) to test the statistical significance of our results. For results in both Tables 2 and 6, we get a p-value<0.01. 6 http://ymcui.com/cmrc2018/ 7 We strictly follow settings provided by Cui et al. (2020) and report the best scores in three times of individual running  It is noticed that the improvement on DRCD is not that significant as CJRC (0.3% v.s. 0.8% on MacBERT large ). One possible explanation is that DRCD is a relatively simple task and the average answer length is 4.9, which means most of the answers are in a single sub-sentence to let our ESP task unnecessary. To validate this, we make statistics on three Chinese MRC datasets to find out the proportions of the examples where a single sub-sentence is sufficient for extracting the answer span, as shown in Table 4. Note that 99.2% examples of DRCD can find answer span in a single sub-sentence, which is consistent with our assumption.

For Other Languages
Although ESPReader is specifically designed for Chinese MRC, we also test our ESPReader on English MRC benchmarks, SQuAD2.0 (Rajpurkar for each baseline. 8 bert-base-uncased 9 electra-base-discriminator 10 Since Asai et al. (2018) only provided test set for both languages, we fine-tune models on CMRC 2018 (from MacBERT base ) and SQuAD1.1 (from ELECTRA base ) and then directly evaluate on Japanese and French, respectively.   Table 5. Even though our model is not supposed to design for English tasks, it still achieves obvious improving compared with two English MRC baseline model (1.8% and 1.6% Avg score, respectively), which indicates our model's potential in dealing with tasks of other languages. To validate this, we further conduct a zero-shot test on Japanese and French datasets provided by Asai et al. (2018), as shown in Tabel 6. On both languages, our method achieves significant improvements, especially the former.
Above results indicate that the location unit ambiguity is a common issue in many languages with different seriousness. As shown of an English MRC example in Figure 3, though strictly-defined as a clause of the answer span, sometimes it could be totally unrelated with respect to the question.

Effect of Each Module
For the purpose of tracking improvement sources of our ESPReader, we conduct thorough ablation studies by adding the proposed modules one by one from the baseline setting (MacBERT base ). The results are in Table 7. It is noticed that by only adding ESP task, the Avg score on CMRC 2018 Dev set is improved by 1.3%. By further adding sentencelevel self-attention layer or fusion cross-attention layer separately, the Avg score does not significantly increase (0.1% and 0.3%, respectively). However, when both of them are included, another visible improvement (0.9%) is obtained.  Considering that we additionally introduce sentence position embedding (E s ) on the basis of BERT's embedding layer, we compared the performance of ESPReader with/without E s . As shown in Table 7, adding E s can bring a marginal improvement (0.2% Avg score).

Scoring Function
We keep other settings unchanged and adopt two scoring functions Score raw and Score ILF , respectively. The results are listed in Table 8.
It is observed that the proposed scoring function Score ILF makes a nontrivial contribution to the performance of our model on both EM and F1 scores, especially the former (2.9% on BERT base and 3.1% on MacBERT base ). This observation is in line with our assumption that ESPReader is forced to pay more attention to whole answer-related sentences with an explicit span-sentence predication and thus results in a length growth in predicted span. Note that our model brings increase on both EM and F1 scores to varying degrees, even though ILF is not applied. This indicates that the model benefits from the location guidance produced by ESP task more than suffering from the length growth of predicted span caused by it, which can be well lessened by ILF.

Error Analysis
To take a deep sight into the sources of precision growth, We further research the distribution of each error (EM = 0) type after applying our method, the general percentage distribution is shown in Table 9 and the details of actual numbers of each error type are shown in Figure 4. Combining them, we find that with ESP our model decreases both the actual numbers and percentage of all error types (except for the surplus error), of which the actual number of F 1 = 0 (which means the predicted span is totally unrelated) is dropped by 26.3% (114 → 84).
It indicates that ESP effectively corrects the location unit ambiguity issue of Chinese MRC. Note that ILF for scoring helps reduce surplus errors but causes more omitting errors. In order to have an concrete insight that how our method helps solve location unit ambiguity, we draw attention heatmap of last encoder layer of MacBERT and our proposed model, as shown in Figure 5. Note that the baseline model focuses much on the answer-unrelated sub-sentence. However, the attention distribution of our model is obviously more focused on the span-sentence, which is contributed by our ESP mechanism.

Conclusion
This paper aims at addressing the newly discovered difficulty of the boundary ambiguity between sentences and sub-sentences, which exists in many languages to different extents and essentially limits the performance of span extraction MRC models, especially in Chinese environment. We apply explicit span-sentence predication (ESP) to enhance model's ability of precisely locating sentences containing the target span. Our proposed model design is evaluated on Chinese span extraction MRC benchmark, CMRC 2018. The experimental results show that our model significantly improves both EM and F1 scores compared with strong baselines and helps achieve a new state-of-the-art performance. Our method also shows generality and potential in dealing with other languages. This work highlights the research line of further improving challenging MRC by analyzing specific linguistics phenomena.

A Details about Sentence Position Embedding
Sentence position embedding (E s ) indicates the sentence offset of each token. For normal tokens, their sentence positions are the offsets of the segmented sentences they belong to. For special tokens, the sentence position of: • [CLS]: Set as 0.
• [SEP]: Equal to that of the nearest normal token it follows.
• [PAD]: Set as that of the last normal token plus 1.
In this way, every token is assigned with a sentence position and then a lookup table is used to map these positions to vectors, which is the sentence position embedding.

B ILF Curves
To concretely show the reflections of ILF to different predicted span lengths (j − i), we draw curves of ILF value, as shown in Figure 6. It can be seen that ILF achieves the minimum value when span length is equal to the average length of all candidate spans. Besides, ILF has a more obvious inhibitory effect on longer spans than shorter spans.

C Settings of Hyperparameters
For the fine-tuning in our tasks in terms of the adopted PrLMs, we set the initial learning rate in {3e-5, 5e-5}. The warm-up rate is set to be 0.1, with a L2 weight decay of 0. The batch size is selected as 24 for base models and 64 for large models. The number of epochs is set to be 2 in all the experiments. Texts are tokenized using wordpieces, with a maximum length of 512 and doc stride of 128. The manual weights are α = 0.5, β = 0.8 and µ = 0.1.

D Details of datasets: DRCD and CJRC
DRCD and CJRC are two different types of Chinese MRC task from CMRC 2018.
• DRCD: This is also a span-extraction MRC task but in Traditional Chinese. Besides, compared with CMRC 2018, DRCD contains much more simple questions with short answers and the overall average answer length is 4.9.
• CJRC: This is a more complex MRC task, which has yes/no, no-answer and spanextraction questions. This dataset is collected in judicial scenarios. Note that we only use 50% samples of big-train-data.json for training for fair comparison with previous work.