Wikification for Scriptio Continua

Yugo Murawaki; Shinsuke Mori

Wikification for Scriptio Continua

Abstract

The fact that Japanese employs scriptio continua, or a writing system without spaces, complicates the first step of an NLP pipeline. Word segmentation is widely used in Japanese language processing, and lexical knowledge is crucial for reliable identification of words in text. Although external lexical resources like Wikipedia are potentially useful, segmentation mismatch prevents them from being straightforwardly incorporated into the word segmentation task. If we intentionally violate segmentation standards with the direct incorporation, quantitative evaluation will be no longer feasible. To address this problem, we propose to define a separate task that directly links given texts to an external resource, that is, wikification in the case of Wikipedia. By doing so, we can circumvent segmentation mismatch that may not necessarily be important for downstream applications. As the first step to realize the idea, we design the task of Japanese wikification and construct wikification corpora. We annotated subsets of the Balanced Corpus of Contemporary Written Japanese plus Twitter short messages. We also implement a simple wikifier and investigate its performance on these corpora.

Anthology ID:: L16-1214
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 1346–1351
Language:
URL:: https://aclanthology.org/L16-1214/
DOI:
Bibkey:
Cite (ACL):: Yugo Murawaki and Shinsuke Mori. 2016. Wikification for Scriptio Continua. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1346–1351, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: Wikification for Scriptio Continua (Murawaki & Mori, LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1214.pdf

PDF Cite Search Fix data