2024
pdf
bib
abs
Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese
Hiroaki Ozaki
|
Kanako Komiya
|
Masayuki Asahara
|
Toshinobu Ogiso
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)
In Japanese, the natural minimal phrase of a sentence is the “bunsetsu” and it serves as a natural boundary of a sentence for native speakers rather than words, and thus grammatical analysis in Japanese linguistics commonly operates on the basis of bunsetsu units.In contrast, because Japanese does not have delimiters between words, there are two major categories of word definition, namely, Short Unit Words (SUWs) and Long Unit Words (LUWs).Though a SUW dictionary is available, LUW is not.Hence, this study focuses on providing deep learning-based (or LLM-based) bunsetsu and Long Unit Words analyzer for the Heian period (AD 794-1185) and evaluating its performances.We model the parser as transformer-based joint sequential labels model, which combine bunsetsu BI tag, LUW BI tag, and LUW Part-of-Speech (POS) tag for each SUW token.We train our models on corpora of each period including contemporary and historical Japanese.The results range from 0.976 to 0.996 in f1 value for both bunsetsu and LUW reconstruction indicating that our models achieve comparable performance with models for a contemporary Japanese corpus.Through the statistical analysis and diachronic case study, the estimation of bunsetsu could be influenced by the grammaticalization of morphemes.
pdf
bib
abs
Evaluating Language Models in Location Referring Expression Extraction from Early Modern and Contemporary Japanese Texts
Ayuki Katayama
|
Yusuke Sakai
|
Shohei Higashiyama
|
Hiroki Ouchi
|
Ayano Takeuchi
|
Ryo Bando
|
Yuta Hashimoto
|
Toshinobu Ogiso
|
Taro Watanabe
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Automatic extraction of geographic information, including Location Referring Expressions (LREs), can aid humanities research in analyzing large collections of historical texts. In this study, to investigate how accurate pretrained Transformer language models (LMs) can extract LREs from historical texts, we evaluate two representative types of LMs, namely, masked language model and causal language model, using early modern and contemporary Japanese datasets. Our experimental results demonstrated the potential of contemporary LMs for historical texts, but also suggest the need for further model enhancement, such as pretraining on historical texts.
2022
pdf
bib
abs
Infinite SCAN: An Infinite Model of Diachronic Semantic Change
Seiichi Inoue
|
Mamoru Komachi
|
Toshinobu Ogiso
|
Hiroya Takamura
|
Daichi Mochihashi
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In this study, we propose a Bayesian model that can jointly estimate the number of senses of words and their changes through time.The model combines a dynamic topic model on Gaussian Markov random fields with a logistic stick-breaking process that realizes Dirichlet process. In the experiments, we evaluated the proposed model in terms of interpretability, accuracy in estimating the number of senses, and tracking their changes using both artificial data and real data.We quantitatively verified that the model behaves as expected through evaluation using artificial data.Using the CCOHA corpus, we showed that our model outperforms the baseline model and investigated the semantic changes of several well-known target words.
2021
pdf
bib
A Comprehensive Analysis of PMI-based Models for Measuring Semantic Differences
Taichi Aida
|
Mamoru Komachi
|
Toshinobu Ogiso
|
Hiroya Takamura
|
Daichi Mochihashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation
2020
pdf
bib
abs
KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora
Teruaki Oka
|
Yuichi Ishimoto
|
Yutaka Yagi
|
Takenori Nakamura
|
Masayuki Asahara
|
Kikuo Maekawa
|
Toshinobu Ogiso
|
Hanae Koiso
|
Kumiko Sakoda
|
Nobuko Kibe
Proceedings of the Twelfth Language Resources and Evaluation Conference
The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.
2012
pdf
bib
abs
UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese
Toshinobu Ogiso
|
Mamoru Komachi
|
Yasuharu Den
|
Yuji Matsumoto
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the levels of lexicon, morphology, grammar, orthography and pronunciation. In order to overcome these problems, we extended dictionary entries and created a training corpus of Early Middle Japanese to adapt UniDic for Contemporary Japanese to Early Middle Japanese. Experimental results show that the proposed UniDic-EMJ, a new dictionary for Early Middle Japanese, achieves as high accuracy (97%) as needed for the linguistic research on lexicon and grammar in Japanese classical text analysis.
2011
pdf
bib
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka
|
Mamoru Komachi
|
Toshinobu Ogiso
|
Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing
2010
pdf
bib
abs
Design, Compilation, and Preliminary Analyses of Balanced Corpus of Contemporary Written Japanese
Kikuo Maekawa
|
Makoto Yamazaki
|
Takehiko Maruyama
|
Masaya Yamaguchi
|
Hideki Ogura
|
Wakako Kashino
|
Toshinobu Ogiso
|
Hanae Koiso
|
Yasuharu Den
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Compilation of a 100 million words balanced corpus called the Balanced Corpus of Contemporary Written Japanese (or BCCWJ) is underway at the National Institute for Japanese Language and Linguistics. The corpus covers a wide range of text genres including books, magazines, newspapers, governmental white papers, textbooks, minutes of the National Diet, internet text (bulletin board and blogs) and so forth, and when possible, samples are drawn from the rigidly defined statistical populations by means of random sampling. All texts are dually POS-analyzed based upon two different, but mutually related, definitions of word. Currently, more than 90 million words have been sampled and XML annotated with respect to text-structure and lexical and character information. A preliminary linear discriminant analysis of text genres using the data of POS frequencies and sentence length revealed it was possible to classify the text genres with a correct identification rate of 88% as far as the samples of books, newspapers, whitepapers, and internet bulletin boards are concerned. When the samples of blogs were included in this data set, however, the identification rate went down to 68%, suggesting the considerable variance of the blog texts in terms of the textual register and style.
2008
pdf
bib
abs
A Proper Approach to Japanese Morphological Analysis: Dictionary, Model, and Evaluation
Yasuharu Den
|
Junpei Nakamura
|
Toshinobu Ogiso
|
Hideki Ogura
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper, we discuss lemma identification in Japanese morphological analysis, which is crucial for a proper formulation of morphological analysis that benefits not only NLP researchers but also corpus linguists. Since Japanese words often have variation in orthography and the vocabulary of Japanese consists of words of several different origins, it sometimes happens that more than one writing form corresponds to the same lemma and that a single writing form corresponds to two or more lemmas with different readings and/or meanings. The mapping from a writing form onto a lemma is important in linguistic analysis of corpora. The current study focuses on disambiguation of heteronyms, words with the same writing form but with different word forms. To resolve heteronym ambiguity, we make use of goshu information, the classification of words based on their origin. Founded on the fact that words of some goshu classes are more likely to combine into compound words than words of other classes, we employ a statistical model based on CRFs using goshu information. Experimental results show that the use of goshu information considerably improves the performance of heteronym disambiguation and lemma identification, suggesting that goshu information solves the lemma identification task very effectively.