Building a Japanese Corpus of Temporal-Causal-Discourse Structures Based on SDRT for Extracting Causal Relations

This paper proposes a methodology for generating specialized Japanese data sets for the extraction of causal relations, in which temporal, causal and discourse relations at both the fact level and the epistemic level, are annotated. We applied our methodology to a number of text fragments taken from the Balanced Corpus of Contemporary Written Japanese. We evaluated the feasibility of our methodology in terms of agreement and frequencies


Introduction
In recent years, considerable attention has been paid to deep semantic processing. Many studies (Betherd et al., 2008), (Inui et al., 2007), (Inui et al., 2003), (Riaz and Girju, 2013) have been recently conducted on deep semantic processing, and causal relation extraction (CRE) is one of the specific tasks in deep semantic processing. Research on CRE is still developing and there are many obstacles that must be overcome. Inui et al. (2003) acquired cause and effect pairs from text, where the antecedent events were taken as causes and consequent events were taken as effects based on Japanese keywords such as kara and node. In (1), for example, the antecedent ame-ga hutta ('it rained') and the consequent mizutamari-ga dekita ('puddles emerged') are acquired as a pair of cause and effect.
(1) Ame-ga rain-NOM hutta-node fall-past-because mizutamari-ga puddles-NOM dekita. emerge-past 'Because it rained, puddles emerged.' However, antecedents are not always causes or reasons for consequents in Japanese, as illustrated by the following example.
(2) Zinsinziko-ga injury.accident-NOM okita-kara happen-past-because densya-ga trains-NOM tiensita delay-past to-iu-wake-dewanai. it.is.not.the.case.that 'It is not the case that the trains were delayed because an injury accident happened.' In example (2), the antecedent zinsinziko-ga okita ('an injury accident happened') is not the cause of the consequent densya-ga tiensita ('the trains were delayed'). Though in such sentences that contain causal expressions there are no causal relations between antecedents and consequents, in existing studies each sentence containing a causal expression was extracted as knowledge representing cause and effect, such as in (Inui et al., 2003). It is difficult for computers to auto-recognize and exclude such cases.
In this paper, we report on the analysis of necessary information for acquiring more accurate cause-effect knowledge and propose a methodology for creating a Japanese corpus for CRE. First, we introduce previous studies and describe information that should be used to annotate data sets. Next, we describe our methodology based on Segmented Discourse Representation Theory (SDRT) (Asher et al., 2003). Finally, we evaluate the validity of our methodology in terms of agreement and frequency, and analyze the results.

Previous Studies
In this section, we introduce previous studies on annotation of temporal, causal and other types of relations and present a linguistic analysis of temporal and causal relations. Betherd et al. (2008) generated English data sets annotated with temporal and causal relations and analyzed interactions between the two types of relations. In addition, these specialized data sets were evaluated in terms of agreement and accuracy. Relations were classified into two causal categories (CAUSAL, NO-REL) and three temporal categories (BEFORE, AFTER, NO-REL). In regard to the evaluation, Betherd et al. pointed out that the classification was coarse-grained, and that reanalysis would have to be performed with more fine-grained relations. Inui et al. (2005) characterized causal expressions in Japanese text and built Japanese corpus with tagged causal relations. However, usages such as that illustrated in (2) and interactions between temporal relations and causal relations were not analyzed. Tamura (2012) linguistically analyzed temporal and causal relations and pointed out that in reason/purpose constructions in Japanese, the event time indicated by the tense sometimes contradicts the actual event time, and that the information necessary to recognize the order between events lies in the choice of the fact and the epistemic levels (we will come back to these notions in the section 3.4), and the explicit or implicit meaning of a sentence in the causal expressions in Japanese. Furthermore, some causal expressions in Japanese are free from the absolute and relative tense systems, and both the past and non-past forms can be freely used in main and subordinate clauses (Chin, 1984) (an example is given in the next section). In other words, temporal relations are not always resolved earlier than causal relations, and therefore we should resolve temporal relations and causal relations simultaneously. Asher et al. (2003) proposed SDRT in order to account for cases where discourse relations affect the truth condition of sentences. Because temporal relations constrain causal relations, the explicit or implicit meaning of a sentences and the epistemic level information affects preceding and following temporal relations in causal expressions in Japanese, recognition also affects causal relations. Therefore, the annotation of both causal relations and discourse relations in corpora is expected to be useful for CRE. Moreover, which characteristics (such as tense, actual event time, time when the event is recognized, meaning and structure of the sentence and causal relations) will serve as input and which of them will serve as output depends on the time and place. Therefore, we should also take into account discourse relations together with tem-poral and causal relations. We can create specialized data sets for evaluating these types of information together by annotating text with discourse, temporal and causal relations.
However, discourse relations of SDRT are not distributed into discourse relations and temporal relations, and as a result the classification of labels becomes unnecessarily complex. Therefore, it is necessary to rearrange discourse relations as in the following example. ( be.curled.up-past 'The dog ran in the garden. The cat was curled up in the kotatsu heater.' This pair of sentences is an antithesis, so we annotate it with the "Contrast" label in SDRT. On the other hand, the situation described in the first sentence overlaps with that of the second sentence, so we annotate this pair of sentences with the "Background" label as well. Though there are many cases in which we can annotate a sentence with discourse relations in this way, dividing temporal relations from discourse relations as in this study allows us to avoid overlapping discourse relations.
This study was performed with the aim to rearrange SDRT according to discourse relations, temporal relations and causal relations separately, and we generated specialized data sets according to our methodology. In addition, occasionally it is necessary to handle the actual event time and the time when the event was recognized individually. An example is given below.
Before we evaluate the consequent kyoo-wa benkyoo-suru-koto-ni sita ('I decided to study today'), we should recognize the fact of the antecedent Asu tesuto-ga aru ('there will be an exam tomorrow'). Whether we deal with the actual In other words, event A temporally subsumes event B.

Methodology
We extended and refined SDRT and developed our own methodology for annotating main and subordinate clauses, phrases located between main and subordinate clauses (e.g., continuative conjuncts in Japanese), two consecutive sentences and two adjoining nodes with a discourse relation. We also defined our own method for annotating propositions with causal and temporal relations. The result of tagging example (5a) is shown in (5b).
The remainder of this section is structured as follows. Sections 3.1 and 3.2 deal with temporal and causal relations, respectively. Section 3.3 covers discourse relations, and Section 3.4 describes the fact level and the epistemic level.

Temporal Relations
We consider the following three temporal relations (Table 1). We assume that they represent the relations between two events in propositions and indicate a start time and an end time. In addition, we also assume that (start time of e) ≤ (end time of e) for all events. Based on this, the temporal placement of each two events is limited to the three relations in Table 1.
In this regard, Japanese non-past predicates occasionally express habitually repeating events, which have to be distinguished from events occurring later than the reference point. In this paper, in annotating the scope of the repetition, habitually repeating events are described as in the following example.

Causal Relations
We tag pairs of clauses with the following relation (Table 2)

Discourse Relations
We consider the following discourse relations based on SDRT (Table 3). There are also relations that impose limitations on temporal and causal relations (Table 4). The way temporal, causal and discourse relations affect each other is described below together with their correspondence to the relations in SDRT. Bold-faced entries represent relations integrated in SDRT in our study. Such limitations on temporal relations provides information for making a decision in terms of temporal order and cause/effect in the "de-tensed" sentence structure 2 (Chin, 1984) in Japanese. An example is given below. 3 According to (Chin, 1984), "de-tensed" is a relation whereby the phrase has lost the meaning contributed by tense, namely, the logical aspect of the semantic relation between an antecedent and a consequent has eliminated the aspect temporal relation between them.
This is a sentence where the subordinate clause is in non-past tense and the main clause is in past tense. Then, we may mistakenly interpret the event in the subordinate clause as occurring after the event of the main clause. However, we can determine that in fact it occurred before the event in the main clause based on the rule imposed by the "Cause" relation.

Fact Level and Epistemic Level
A fact level proposition refers to an event and its states, while an epistemic level proposition refers to speaker's recognizing event of a described event. In Japanese, the latter form is often marked by the suffix noda that attaches to all kinds of predicates (which may also be omitted). Both overt and covert noda introduce embedded structures, and we annotate them in such a way that a fact level proposition is embedded in an epistemic level proposition. Semantically, the most notable difference between the two levels is that the tense in the former represents the time that an event takes place, while the tense in the latter represents the time that the speaker recognizes the event.
This distinction between the two types of propositions is carried over to the distinction between the fact level and the epistemic level causal relations. We annotate the former by the tag "Cause" and the latter by the tag "Explanation".
In Japanese, a causal marker such as node (a continuation form of noda) and kara are both used in the fact level and the epistemic level. The fact level causality is a causal relation between the two events, while the epistemic level causality is a causal relation between the two recognizing events of the two events mentioned. Therefore, in the causal construction, it happens that the precedence relations between the subordinate and the matrix clauses in the fact level and the epistemic level do not coincide, as in the following example.
The temporal relation at the fact level is that π3 precedes π1. By contrast, that at the epistemic level is that π2 precedes π4. By describing the relation between π1 and π3 and that between π2 and π4 separately, we can reproduce the relationship at both levels.

Merits
We defined our methodology for annotating text fragments at both the fact and epistemic levels in parallel with temporal, causal and discourse relations. Therefore, we can generate specialized data sets that enable estimating the causality in the fact and epistemic levels by various cues (such as known causal relations, truth condition, conjunctions and temporal relations between sentences or clauses).
In addition, we can say that causal expressions without causation are not in a causal relation (and vice versa) by annotating text with both discourse and causal relations.

Results
We applied our methodology to 66 sentences from the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa, 2008). The sentences were decomposed by one annotator, and labels were assigned to the decomposed segments by two annotators. During labeling, we used the labels presented in Section 3. Our methodology was developed based on 96 segments (38 sentences), and by using the other 100 segments (28 sentences), we evaluated the inter-annotator agreement as well as the frequencies of decomposition and times of annotation. The agreement for 196 segments generated from 28 sentences amounted to 0.68 and was computed as follows (the kappa coefficient for them amounted to 0.79).

Agreement = Agreed labels/T otal labels
Analyzing more segments in actual text and improving our methodology can lead to further improvement in terms of agreement. Table 5 shows the distribution of labels into segments in our study.  We can see from Table 5 that "Narration" was the most frequent one, while "Alternation" never appeared. As s result, we can assume that frequent relations will be separated from non-frequent relations. So far, all the relations are either frequent or non-frequent. We should re-analyze the data with more samples again.
When the methodology was applied to 28 sentences, a total of 100 and an average of 3.57 segments were derived. This is the number of segments at both the fact and epistemic levels. Without dividing the fact and epistemic levels, an average of 1.79 segments were derived.
On average, 11 segments per hour were tagged in our study. Although we should evaluate the validity after having computed the average decomposition times, it is assumed that our methodology is valid when focusing only on labeling.

Discussion
We analyzed errors in this annotation exercise. The annotators often found difficulties in judging temporal relations in the following two cases: (1) the case where it was difficult to determine the scope of the segments pairing and (2) the case where formalization of lexical meaning is difficult.
In regard to the first case, how to divide segments sometimes affects temporal relations. In the following example, consider the temporal relation between the first and the second sentences.
(9) Marason-ni marathon-DAT syutuzyoo-sita. participate-past. sonohi-wa that.day-TOP 6zi-ni 6:00-at kisyoo-si, get.up-past, 10zi-ni 10:00-at totyoo-kara Metropolitan.Government-from syuppatu-site, leave-past, 12zi-ni 12:00-at kansoo-sita. finish.running-past. 'I participated in marathon. I got up at 6:00 on that day and left the Metropolitan Government at 10:00 and finished running at 12:00.' When we focus on the first segment of the second sentnce ('I got up at 6:00'), its relation to the first sentence appears to be "Precedence". However, if we consider the second and the third segments as the same segment, their relation to the first sentence appears to be "Subsumption". Therefore, we should establish clear criteria for the segmentation. Although we currently adopts a criterion that we chose smaller segment in unclear cases, there still remain 9 unclear cases (temporal:5, discourse:4).
One of the reason why Kappa coefficient marks relatively high score is that we only compare the labels and ignore the difference in the segmentations. Criteria for deciding the segment scope in paring segments will improve our methodology.
The second case is exemplified by the temporal relation between the subordinate clause and the main clause in the following sentence.
(10) Migawari-no scapegoat-GEN tomo-o friend-ACC sukuu-tame-ni to.save hasiru-noda. run-noda. 'I run to save my friend who is my scapegoat.' If we consider that the saving event only spans over the very moment of saving, the relation between the clauses appears to be "Precedence". However, if we consider that running event is a part of the saving event, the relation between the clauses is "Subsumption".
Thus, judging lexical meaning with respect to when events start and end involves some difficulties and they yield delicate cases in judging temporal relations.
These problems are mutually related, and the first problem arises when the components of a lexical meaning are displayed explicitly in the sentence, and the second problem arises when they are implicit.

Conclusions
We analyzed and proposed our methodology based on SDRT for building a more precise Japanese corpus for CRE. In addition, we annotated 196 segments (66 sentences) in BCCWJ with temporal relations, discourse relations, causal relations and fact level and epistemic level propositions and evaluated the annotations of 100 segments (28 sentences) in terms of agreement, frequencies and times for decompositions. We reported and analyzed the result and discussed problems of our methodology.
The discrepancies of decomposition patterns were not yet empirically compared in the present study and will be investigated in future work.