Evaluation Scheme of Focal Translation for Japanese Partially Amended Statutes

For updating the translations of Japanese statutes based on their amendments, we need to consider the translation “focality;” that is, we should only modify expressions that are relevant to the amendment and retain the others to avoid misconstruing its contents. In this paper, we introduce an evaluation metric and a corpus to improve focality evaluations. Our metric is called an Inclusive Score for DIfferential Translation: (ISDIT). ISDIT consists of two factors: (1) the n-gram recall of expressions unaffected by the amendment and (2) the n-gram precision of the output compared to the reference. This metric supersedes an existing one for focality by simultaneously calculating the translation quality of the changed expressions in addition to that of the unchanged expressions. We also newly compile a corpus for Japanese partially amendment translation that secures the focality of the post-amendment translations, while an existing evaluation corpus does not. With the metric and the corpus, we examine the performance of existing translation methods for Japanese partially amendment translations.


Introduction
In the world's globalized society, governments must quickly announce their statutes worldwide to facilitate international trade, economic investments, legislation support, and so on. The Japanese government addressed this issue in April 2009 by launching the Japanese Law Translation Database System (JLT) (Toyama et al., 2011) where it announces the English translations of Japanese statutes.
However, as of January 2020, only 23.4% (163/697) of the translated statutes in JLT correspond to their latest versions (Yamakoshi et al., 2020). After amending a statute, its translation must be promptly updated to avoid creating confusion among international readers. Unfortunately, statutory sentences are much tougher to translate than ordinary sentences because the former are highly technical, complex, and long.
Furthermore, when translating statutory sentences that are partially modified by an amendment, we must consider focal translations. That is, we should only modify expressions that are changed by the amendment without changing the others. For example, consider the following sentence: " " (The request shall be made in a document stating the facts of the accident.) Its amendment rewrote " " (jiko; accident) to " " (kainan; marine accident). The following revision satisfies the focality requirement: "The request shall be made in a document stating the facts of the marine accident" because it contains minimum modifications. On the other hand, although "The ::::::: petition shall be made in a document ::::::::: describing the facts of the marine accident" is fluent and adequate, it is unsuitable as a revision from the focality perspective because " " (moshitate; request) and " " (shimeshite; stating), which are irrelevant to the amendment, were changed. Yamakoshi et al. (2020) proposed a machine translation method for Japanese partially amendment translation that generates translation candidates by a Transformer (Vaswani et al., 2017)based neural machine translation (NMT) model. It selects the best one by comparing the candidates with the output of a template-aware statistical machine translation (SMT) model (e.g., (Koehn and Senellart, 2010;Kozakai et al., 2017)) that only changes the affected expressions. They also proposed an evaluation metric for the focality of the translations.
However, we argue that two matters from their study must be improved: the evaluation metric 第百六十四条 ① 第四項を削り、 ② 第三項後段を削り、 ③ 同項第一号中 「の父母」を「(十五歳以上のものに限る。 )」に改め、 ④ 同項 第二号中「前号に掲げる」を「…に対し親権を行う」に改め、 ⑤ 同項第三号を削り、 ⑥ 同項を同条第六項とし、 ⑦ 同項の次に次の 一項を加える。 ７ 特別養子適格の…(省略) Amendment sentence in an amendment act (Act No. 34 of 2019) In Article 164, ① delete paragraph 4, ② delete the latter part of paragraph 3, ③ replace "the parents of" with "(limited to a child of 15 years of age or older)" in item (i) of the same paragraph, ④ replace "set forth in the preceding item" with "who exercises parental authority over …" in item (ii) of the same paragraph, ⑤ delete item (iii) of the same paragraph, ⑥ regard the same paragraph as paragraph 6 of the same Article, ⑦ add the following paragraph next to the same paragraph: 7 … of special adoption eligibility … (omitted) Translation Figure 1: Amendment sentence and the dataset. Their metric consists of two factors: (1) the n-gram recall of expressions unaffected by amendments and (2) a redundant penalty for lengthy outputs. Although with this metric we can evaluate how completely the method retained expressions irrelevant to the amendment, we cannot evaluate how adequately it translated expressions relevant to the amendment. The second is the dataset they used for their experiments.
Their translation examples of partially amended statutory sentences are from amendment-versioncontrolled bilingual statutes in JLT. However, translations in JLT are not always focal. Therefore, their reported scores do not seem accurate.
In this paper, we solve these two matters. For the first, we introduce another metric for focality called the Inclusive Score for DIfferential Translation (ISDIT), which incorporates n-gram precision between the output and the reference instead of a redundant penalty. With this modification, the metric simultaneously evaluates the translation quality of both the changed and unchanged expressions that indicate the quality of the focal translation. For the second, we compile a corpus that secures focality between pre-and postamendment translations and achieve it by asking professional human translators to translate focal post-amendment translations.
This paper makes the following contributions to amended statutory sentence translation tasks: • introduces a new metric that more adequately reflects the focality of translations; • compiles a translation corpus that ensures the focality of post-amendment translations; • examines the translation performance of relevant methods with a metric and a corpus. This paper is organized as follows. In Section 2, we clarify the background of our study. In Section 3, we explain related work. In Section 4, we describe our proposal and present our evaluation experiments and discussions in Section 5. Finally, we summarize and conclude in Section 6.

Background
In this section, we clarify the background of our study. First, we introduce the partial amendment process in Japanese legislation from the viewpoint of document modification and then we identify our study objective in the process.

Partial Amendments in Japanese Legislation
In Japanese legislation, a partial amendment is created by "patching" modifications to a target statute. Such modifications are prescribed as amendment sentences in an amendment statute.
Based on their functions, Ogawa et al. (2008) categorized such modifications as follows: For modifying part of a sentence, Japanese legislation rules (Hoseishitsumu-Kenkyukai, 2018) mandate that the target expressions must be unique and form a chunk of meaning. Figure 1 shows an example of an amendment sentence prescribed by an amendment act. Any of the seven modifications in the sentence can be assigned to one of the categories described above: Modifications 1 ⃝, 2 ⃝, and 5 ⃝ respectively belong to category 2. (c) of a paragraph, a sentence, an item; modifications 3 ⃝ and 4 ⃝ belong to category 1. (a); modification 6 ⃝ belongs to category 3. (c); modification 7 ⃝ belongs to category 2. (b). Most statutes enacted in recent years are amendment statutes. According to Nihon Horei Sakuin (Index of Japanese Statutes) 1 , 78% (73/94) of acts enacted in 2019 are amendment ones. After

Post-amendment original sentence
the request set forth in the preceding para-graph shall be made in a document stating the facts of the marine accident .

Pre-amendment translated sentence
the request set forth in the preceding para-graph shall be made in a document stating the facts of the marine accident and the details of the intentional or negligent act committed in the course of duties of the examinee .

Post-amendment translated sentence (Our objective)
Amend Translate Translate Update Figure 2: Differential translations in an amended statutory sentence amending statutes, we should update their translations provided in JLT promptly. However, regarding the discussion in the introduction, many statutes available in JLT are out of date, which can provide wrong legal facts to international readers.

Objective
To solve the problem discussed in the previous section, our study focuses on translating partially amended statutes automatically. More specifically, it adopts a task declared by Yamakoshi et al. (2020). Among the categories described in the previous section, the task focuses on categories that modify the parts of an existing statutory sentence (i.e., category 1). In Fig. 1, modifications 3 ⃝ and 4 ⃝ are the targets. It also targets category 2, especially modifications that insert an additional sentence (e.g., a proviso) into an existing element or delete a sentence since such additions and deletions affect the main sentence. Modification 2 ⃝ in Fig. 1, which removes the latter part, is a case.
The task takes a triple of sentences (a preamendment original sentence, a post-amendment original sentence, and a pre-amendment translated sentence) as input and generates a translation for the post-amendment original sentence called a post-amendment translated sentence. Pre-and post-amendment original sentences are statutory sentences in a statute before and after an amendment, respectively. A pre-amendment translated sentence is a translation of the pre-amendment original sentence. Figure 2 illustrates this task.
In generating post-amendment translated sentences, Yamakoshi et al. advocated the focality of translations. This idea argues for only modifying expressions that are changed by the amendment without changing the others based on two reasons from the viewpoint of precise publicization. First, such sentences clearly represent the amendment contents, which helps international readers under-stand them. On the other hand, non-focal translations contain unnecessary modifications, which blur the amendment contents. Second, since the expressions in the pre-amendment translated sentences are assumed to be reliable, reusing them ensures translation quality.
For example, assume that an amendment statute instructs that we should replace " " (kainan no jijitsu; the facts of the marine accident) with " "(kainan no jijitsu oyobi ... no naiyo; the facts of the marine accident and the details of ...)" as depicted in Figure 2.
In this case, we should replace "the facts of the marine accident" in the pre-amendment translated sentence with "the facts of the marine accident and the details of ..." and retain the other expressions to comply with the focality.
We define our task as follows: Input: Pre

Related Work
We describe related work in this section. We overview the suitable machine translation methods for partially amended sentences in Section 3.1. We discuss metrics and data in Sections 3.2 and 3.3.

Method
We consider the focality of translations, which is uncommon in ordinary machine translation tasks. To achieve focal translations, the unchanged expressions must be retained as they appear in the pre-amendment translation. One solution is using a template-aware SMT method. Koehn and Senellart (2010)'s method is a choice, which can retain the unchanged expressions in the preamendment translations by copying them to the post-amendment translations. Kozakai et al. (2017) optimized this method to Japanese partially amendment translation by applying the following two modifications. First, they used pre-amendment original sentences and their translations instead of a relevant pair from the translation memory. Second, to determine objective expressions, they used the underlined information in a comparative table instead of the edit distance. Such underlined information is more reasonable as a translation unit than edit distance since sentence modification is done by a chunk of meaning in Japanese legislation. Both methods can meet the focality requirement by copying the unchanged expressions in the preamendment translated sentences. However, the translation quality, especially fluency, suffers for the following three reasons. First, they use SMT for the translation model, which is typically outperformed by NMT. Second, their methods completely lock the unchanged expressions, which may strongly restrict the translations. Third, they use word alignment to find English expressions that correspond to Japanese ones, perhaps weakening their performance due to alignment error. Yamakoshi et al. (2020)'s method solved these problems by incorporating NMT with a templateaware SMT. Their method, which uses an NMT model and a template-aware SMT model, allows the former to output n-best translations as candidates by applying Monte Carlo (MC) dropout (Gal and Ghahramani, 2016) to improve the output diversity. It then chooses the candidate that most resembles the interim reference translation generated from a template-aware SMT model. Kozakai et al. (2017) used BLEU (Papineni et al., 2002) and RIBES (Hirao et al., 2014) as automatic evaluation metrics in their experiment. BLEU's calculation is based on n-gram precision between the system output and references; RIBES's calculation is based on word-order correlation. Therefore, RIBES is more sensitive to drastic structural modifications. However, both metrics are indifferent to whether an expression in the system output is a changed part in the amendment, and thus both fail to indicate the quality of the focality. Yamakoshi et al. (2020) proposed focality scores to solve this issue. A focality score quantizes the focality of the system output by calculating the recall of the n-grams shared by both the pre-and post-amendment translations. With pre-amendment translated sentence W PrT and actual post-amendment translated sentence W PoT written by humans, we calculate focality score Foc( W PoT ; W PrT , W PoT ) of generated sentence W PoT as follows:

Metrics
where RP avoids overestimating the scores of the redundant sentences. |W | is the word count of W .
Rec is the recall of the n-grams shared by W PrT and W PoT , calculated as follows: where c W (s) is the number of occurrences of the n-gram s in W , and CN(W), where W = {W 1 , W 2 , · · · , W m }, returns common n-grams of W 1 , W 2 , · · · , W m : where ngrams(W ) returns all n-grams in W for a given n. We use multiple lengths of n-grams: where i-gram(W ) returns the i-grams of W .

W PoT
The ::::::: petition set forth in the preceding paragraph shall be made in :::::: writing ::::::::: describing the facts of the marine accident and the details of the intentional or negligent act committed in the course of duties of the examinee. Focal W PoT The :::::: request set forth in the preceding paragraph shall be made in : a :::::::::: document :::::: stating the facts of the marine accident and the details of the intentional or negligent act committed in the course of duties of the examinee. However, some of these examples are not focal because they contain modifications irrelevant the amendment. Table 1 describes such an example. The straight lines in its sentences depict modifications that correspond to the amendment, and the wavy lines depict modifications irrelevant to the amendment. "Request," "a document," and "stating" in W PrT are replaced with "petition," "writing," and "describing" in W PoT , respectively, although corresponding Japanese expressions " " (moshitate), " " (shomen), and " " (shimeshite) was retained throughout the amendment. An ideal translation for W PoT is shown in the table's last row that retains all the expressions irrelevant to the amendment.

Proposal
In this section, we propose an evaluation scheme for Japanese partially amendment translations. Our evaluation scheme includes a new evaluation metric ISDIT and a differential translation corpus that secures the focality of its examples.

ISDIT Scores
The focality score in Section 3.2 assesses only the retention rate of the unchanged expressions in W PrT . That is, it is unaware of the adequacy of expressions that are relevant to the amendment. Therefore, we update the focality scores so that they assess both factors. Our metric, Inclusive Score for DIfferential Translation (ISDIT), is calculated as follows: where Rec is the recall defined in Eq. 3. Pre is the precision of system output W PoT compared to reference W PoT , which is calculated as follows: For example, we consider the example shown in Table 2. Case 1 contains an unnecessary modification, and Case 2 fails to translate " " (yonjuman; four hundred thousand) that is relevant to the amendment. The focality score penalizes the first case, but not the second case. ISDIT penalizes both. From the viewpoint of focal translations that should reflect the amendment contents, penalizing both the unnecessary modification errors and amended phrase translation errors is preferable.

Focal Differential Translation Corpus
As discussed in Section 3.3, the differential translation corpus compiled by Kozakai et al. (2017) includes non-focal examples. To provide a fairer evaluation, we compiled a new corpus that secures the focality of every translation example. We applied the following instructions for the corpus compilation: 1. Compile the versions of statutes provided in JLT; 2. Compile those provided in e-LAWS 3 ; 3. Compile statutes whose JLT version lags behind its e-LAWS version; Sort Content ISDIT Foc. W PrO --W PrT A request for recall requires joint signatures of more than eight hundred thousand people.
--W PoO --W PoT A request for recall requires joint signatures of more than four hundred thousand people.
--Case 1 A ::::::: petition for recall requires joint signatures of more than four hundred thousand people.
0.82 0.70 Case 2 A request for recall requires joint signatures of more than :::: forty hundred thousand people.
0.85 1.00 Step 3: map old and new statutes

Sentence
Recall requires signatures of two hundred people.

Experiment
We experimentally evaluated the machine translation methods with our new resources.

Outline
For training data, we mixed two bilingualstatutory sentence corpora. One was made by Kozakai et al. (2017) from JLT. This corpus consists of 158,928 sentence pairs from 407 statutes. We compiled the other one from statutes in JLT that we collected in Step 1 in Section 4.2. Our corpus consists of 232,830 sentence pairs from 462 statutes. We split our differential translation corpus into development data and test data by the statutes. The development and test data respectively consisted of 745 examples from 30 amendments and 738 examples from 32 amendments.
We used Transformer (Vaswani et al., 2017) for the NMT model under the following settings: six encoder/decoder hidden layers, eight selfattention heads, 512 hidden vectors, a batch size of eight, and an input sequence length of 256. We implemented the training and prediction codes based agreement, tense agreement, article selection) in the expressions outside the amendment if they are triggered by it.  on the TensorFlow official model 5 . We used Sen-tencePiece (Kudo and Richardson, 2018) as a tokenizer and set the vocabulary size to 8,192. We chose a dropout rate of 0.1 for training, which is the default setting of the official Transformer implementation. In the prediction phase, we executed the model with two dropout rates, 0.0 and 0.1, where a 0.0 dropout means that no dropout was applied. We investigated the optimal number of iterations from {100k, 200k, · · · , 2,000k} using the development data.
The following are the settings of these template-aware SMTs: GIZA++ (Och and Ney, 2005) for the word alignment, SRILM (Stolcke, 2002) for the language model generation, and Moses (Koehn et al., 2007) for the decoder. We used MeCab (Kudo et al., 2004) for the Japanese tokenizer.

Results
::: The :::::::: systems ::::: listed :::::: below ::: and a system to ensure the appropriateness of the operations of a group consisting of NHK and its subsidiary companies Yamakoshi ::: The :::::::::: following ::::::: systems :::: and :::: any :::::: other system to ensure the appropriateness of the operations of the group comprised of NHK and its subsidiary company: Kozakai A system to ensure the appropriateness of the operations of the group forming ::: the :::::::: following :::::::: systems :::: and ::: any ::::: other association and its subsidiary company template-aware SMT and MC dropout were also both effective. One different finding from their report is that using the Koehn model generally worked more effectively than the Kozakai model. For our ISDIT metric, the combination methods of Yamakoshi et al. (2020) outperformed the naive template-aware SMT methods.

Discussion
First, we identified the characteristics of ISDIT. The plots in Fig. 4 indicate the focality and IS-DIT scores of the Transformer + Kozakai model + MC dropout method (hereinafter "Yamakoshi method") for each translation example. The focality score of every example is higher than or equal to its ISDIT score. This result is natural because both these metrics share n-gram recall calculation, and ISDIT introduces n-gram precision that is more severe than the redundant penalty in the focality scores. We can observe many examples that have high focality scores but low IS-DIT scores. Table 5 shows such an example. Yamakoshi method's output evaluated 100.0 focality scores and 39.74 ISDIT scores. In this example, however, their system failed to translate " " in " ," which denotes a "fire-related disaster." This mistake greatly changed the system output from the reference, which suffered a low ISDIT score. On the other hand, since expressions shared by W PrT and W PoT were retained in the system output with no redundant generation, it received the maximum focality score. Table 4 shows the correlation coefficients among the evaluation metrics. ISDIT and the focality scores have a high correlation coefficient of 0.927. ISDIT has also a strong relationship with BLEU, which is 0.960. High coefficients among them seem to come from a shared calculation strategy that utilizes the n-gram match rate.
Next we conducted a short qualitative analysis of our corpus. Table 6 shows a translation example. In this example, we replace " " (kyokai) with " " (tsugi ni kakageru taisei sonotano kyokai). Its translation is divided into two parts: "the systems listed below and" (corresponding to " ") and "NHK" (corresponding to " "), which generally happens in Japanese partially amendments. The Kozakai method (also the Koehn method) cannot cope with this kind of examples: They put all the translation of the changed expression in W PoO to the position where such changed expression appears in W PrT .
Another tricky point in this case is the translation of " ," which generally means "association." However, here it denotes "NHK" (Japan Broadcasting Corporation). The Kozakai method failed to appropriately translate this word, possibly because it did not use the context of the translation target, " ." On the other hand, the Yamakoshi method successfully placed the new expression and adequately translated " ." Its success reflects its use of the whole sentence in the translation.

Summary
We proposed a better evaluation scheme for Japanese partially amendment translations and developed a new metric called ISDIT that assesses the translation quality of both changed and unchanged expressions. We also compiled a corpus that secures the focality of translation. Using our corpus, we observed the characteristics of translation methods and ISDIT.
Our future work will increase the size of our corpus so that it can be used for neural network training, considering the publicization of the corpus. We will also identify the best weighting of the two factors in ISDIT. Third, we will consider applications of ISDIT to other domains of versioncontrolled documents such as contracts, technical documents, and product manuals.