Yo Sato

2024

Disambiguating Homographs and Homophones Simultaneously: A Regrouping Method for Japanese
Yo Sato
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present a method that re-groups surface forms into clusters representing synonyms, and help disambiguate homographs as well as homophone. The method is applied post-hoc to trained contextual word embeddings. It is beneficial to languages where both homographs and homophones abound, which compromise the efficiency of language model and causes the underestimation problem in evaluation. Taking Japanese as an example, we evaluate how accurate such disambiguation can be, and how much the underestimation can be mitigated.

2020

pdf bib abs

Homonym normalisation by word sense clustering: a case in Japanese
Yo Sato | Kevin Heffernan
Proceedings of the 28th International Conference on Computational Linguistics

This work presents a method of word sense clustering that differentiates homonyms and merge homophones, taking Japanese as an example, where orthographical variation causes problem for language processing. It uses contextualised embeddings (BERT) to cluster tokens into distinct sense groups, and we use these groups to normalise synonymous instances to a single representative form. We see the benefit of this normalisation in language model, as well as in transliteration.

pdf bib abs

Dialect Clustering with Character-Based Metrics: in Search of the Boundary of Language and Dialect
Yo Sato | Kevin Heffernan
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present in this work a universal, character-based method for representing sentences so that one can thereby calculate the distance between any two sentence pair. With a small alphabet, it can function as a proxy of phonemes, and as one of its main uses, we carry out dialect clustering: cluster a dialect/sub-language mixed corpus into sub-groups and see if they coincide with the conventional boundaries of dialects and sub-languages. By using data with multiple Japanese dialects and multiple Slavic languages, we report how well each group clusters, in a manner to partially respond to the question of what separates languages from dialects.

Lexicalising Word Order Constraints for Implemented Linearisation Grammar
Yo Sato
Student Research Workshop

Co-authors

Yusuke Miyao 1

Wai Lok Tam 1

Jun’ichi Tsujii 1

Venues

Fix author

Yo Sato

2024

2020

2018

2009

2008

2006

Co-authors

Venues