Japanese Realistic Textual Entailment Corpus

Yuta Hayashibe


Abstract
We perform the textual entailment (TE) corpus construction for the Japanese Language with the following three characteristics: First, the corpus consists of realistic sentences; that is, all sentences are spontaneous or almost equivalent. It does not need manual writing which causes hidden biases. Second, the corpus contains adversarial examples. We collect challenging examples that can not be solved by a recent pre-trained language model. Third, the corpus contains explanations for a part of non-entailment labels. We perform the reasoning annotation where annotators are asked to check which tokens in hypotheses are the reason why the relations are labeled. It makes easy to validate the annotation and analyze system errors. The resulting corpus consists of 48,000 realistic Japanese examples. It is the largest among publicly available Japanese TE corpora. Additionally, it is the first Japanese TE corpus that includes reasons for the annotation as we know. We are planning to distribute this corpus to the NLP community at the time of publication.
Anthology ID:
2020.lrec-1.843
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6827–6834
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.843
DOI:
Bibkey:
Cite (ACL):
Yuta Hayashibe. 2020. Japanese Realistic Textual Entailment Corpus. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 6827–6834, Marseille, France. European Language Resources Association.
Cite (Informal):
Japanese Realistic Textual Entailment Corpus (Hayashibe, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.843.pdf