fugashi, a Tool for Tokenizing Japanese in Python

Paul McCann


Abstract
Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.
Anthology ID:
2020.nlposs-1.7
Volume:
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | NLPOSS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
44–51
Language:
URL:
https://aclanthology.org/2020.nlposs-1.7
DOI:
10.18653/v1/2020.nlposs-1.7
Bibkey:
Cite (ACL):
Paul McCann. 2020. fugashi, a Tool for Tokenizing Japanese in Python. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 44–51, Online. Association for Computational Linguistics.
Cite (Informal):
fugashi, a Tool for Tokenizing Japanese in Python (McCann, NLPOSS 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.nlposs-1.7.pdf
Video:
 https://slideslive.com/38939744
Code
 polm/fugashi