Developing a Dataset of Overridden Information in Wikipedia

Masatoshi Tsuchiya, Yasutaka Yokoi


Abstract
This paper proposes a new task of detecting information override. Since all information on the Web is not updated in a timely manner, the necessity is created for information that is overridden by another information source to be discarded. The task is formalized as a binary classification problem to determine whether a reference sentence has overridden a target sentence. In investigating this task, this paper describes a construction procedure for the dataset of overridden information by collecting sentence pairs from the difference between two versions of Wikipedia. Our developing dataset shows that the old version of Wikipedia contains much overridden information and that the detection of information override is necessary.
Anthology ID:
2022.lrec-1.601
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5601–5608
Language:
URL:
https://aclanthology.org/2022.lrec-1.601
DOI:
Bibkey:
Cite (ACL):
Masatoshi Tsuchiya and Yasutaka Yokoi. 2022. Developing a Dataset of Overridden Information in Wikipedia. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5601–5608, Marseille, France. European Language Resources Association.
Cite (Informal):
Developing a Dataset of Overridden Information in Wikipedia (Tsuchiya & Yokoi, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.601.pdf