So Miyagawa

2025

RAG-Enhanced Neural Machine Translation of Ancient Egyptian Text: A Case Study of THOTH AI
So Miyagawa
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

This paper demonstrates how Retrieval-Augmented Generation (RAG) significantly improves translation accuracy for Middle Egyptian, a historically rich but low-resource language. We integrate a vectorized Coptic-Egyptian lexicon and morphological database into a specialized tool called THOTH AI. By supplying domain-specific linguistic knowledge to Large Language Models (LLMs) like Claude 3.5 Sonnet, our system yields translations that are more contextually grounded and semantically precise. We compare THOTH AI against various mainstream models, including Gemini 2.0, DeepSeek R1, and GPT variants, evaluating performance with BLEU, SacreBLEU, METEOR, ROUGE, and chrF. Experimental results on the coronation decree of Thutmose I (18th Dynasty) show that THOTH AI’s RAG approach provides the most accurate translations, highlighting the critical value of domain knowledge in natural language processing for ancient, specialized corpora. Furthermore, we discuss how our method benefits e-learning, digital humanities, and language revitalization efforts, bridging the gap between purely data-driven approaches and expert-driven resources in historical linguistics.

pdf bib

Automatic Detection of Coptic Text Reuse: Applying Coptic Wordnet to Intertextuality Studies in Selected Coptic Monastic Writings
So Miyagawa | Luis Morgado da Costa | Laura Slaughter | Heike Behlmer
Proceedings of the 13th Global Wordnet Conference

pdf bib

Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | Yuri Bizzoni | So Miyagawa | Khalid Alnajjar
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

2024

pdf bib abs

Enhancing Neural Machine Translation for Ainu-Japanese: A Comprehensive Study on the Impact of Domain and Dialect Integration
Ryo Igarashi | So Miyagawa
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Neural Machine Translation (NMT) has revolutionized language translation, yet significant challenges persist for low-resource languages, particularly those with high dialectal variation and limited standardization. This comprehensive study focuses on the Ainu language, a critically endangered indigenous language of northern Japan, which epitomizes these challenges. We address the limitations of previous research through two primary strategies: (1) extensive corpus expansion encompassing diverse domains and dialects, and (2) development of innovative methods to incorporate dialect and domain information directly into the translation process. Our approach yielded substantial improvements in translation quality, with BLEU scores increasing from 32.90 to 39.06 (+6.16) for Japanese → Ainu and from 10.45 to 31.83 (+21.38) for Ainu → Japanese. Through rigorous experimentation and analysis, we demonstrate the crucial importance of integrating linguistic variation information in NMT systems for languages characterized by high diversity and limited resources. Our findings have broad implications for improving machine translation for other low-resource languages, potentially advancing preservation and revitalization efforts for endangered languages worldwide.

pdf bib abs

Language Atlas of Japanese and Ryukyuan (LAJaR): A Linguistic Typology Database for Endangered Japonic Languages
Kanji Kato | So Miyagawa | Natsuko Nakagawa
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

LAJaR (Language Atlas of Japanese and Ryukyuan) is a linguistic typology database focusing on micro-variation of the Japonic (Japanese and Ryukyuan) languages. This paper aims to report the design and progress of this ongoing database project. Finally, we also show a case study utilizing its database on zero copulas among the Japonic languages.

pdf bib

Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | So Miyagawa | Khalid Alnajjar | Yuri Bizzoni
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

pdf bib abs

Exploring Similarity Measures and Intertextuality in Vedic Sanskrit Literature
So Miyagawa | Yuki Kyogoku | Yuzuki Tsukagoshi | Kyoko Amano
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS) and Kāṭhaka-Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters and components that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. The computational analysis corroborates previous philological studies, suggesting a shared period of composition between MS.1.9 and MS.1.7. This research highlights the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches. The agreement among the methods strengthens the validity of the findings, and the visualizations offer a nuanced understanding of textual connections. The study demonstrates that smaller chunk sizes are more effective for detecting intertextual parallels, showcasing the power of these techniques in unraveling the complexities of ancient texts.

pdf bib abs

Assessing Large Language Models in Translating Coptic and Ancient Greek Ostraca
Audric-Charles Wannaz | So Miyagawa
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

The advent of Large Language Models (LLMs) substantially raised the quality and lowered the cost of Machine Translation (MT). Can scholars working with ancient languages draw benefits from this new technology? More specifically, can current MT facilitate multilingual digital papyrology? To answer this question, we evaluate 9 LLMs in the task of MT with 4 Coptic and 4 Ancient Greek ostraca into English using 6 NLP metrics. We argue that some models have already reached a performance apt to assist human experts. As can be expected from the difference in training corpus size, all models seem to perform better with Ancient Greek than with Coptic, where hallucinations are markedly more common. In the Coptic texts, the specialised Coptic Translator (CT) competes closely with Claude 3 Opus for the rank of most promising tool, while Claude 3 Opus and GPT-4o compete for the same position in the Ancient Greek texts. We argue that MT now substantially heightens the incentive to work on multilingual corpora. This could have a positive and long-lasting effect on Classics and Egyptology and help reduce the historical bias in translation availability. In closing, we reflect upon the need to meet AI-generated translations with an adequate critical stance.

2023

pdf bib abs

Machine Translation for Highly Low-Resource Language: A Case Study of Ainu, a Critically Endangered Indigenous Language in Northern Japan
So Miyagawa
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

This paper explores the potential of Machine Translation (MT) in preserving and revitalizing Ainu, an indigenous language of Japan classified as critically endangered by UNESCO. Through leveraging Marian MT, an open-source Neural Machine Translation framework, this study addresses the challenging linguistic features of Ainu and the limitations of available resources. The research implemented a meticulous methodology involving rigorous preprocessing of data, prudent training of the model, and robust evaluation using the SacreBLEU metric. The findings underscore the system’s efficacy, achieving a SacreBLEU score of 32.90 for Japanese to Ainu translation. This promising result highlights the capacity of MT systems to support language preservation and aligns with recent research emphasizing the potential of computational techniques for low-resource languages. The paper concludes by affirming the significant role of MT in the broader context of language preservation, serving as a crucial tool in the fight against language extinction. The study paves the way for future research to harness advanced MT techniques and develop more sophisticated models for endangered languages.

pdf bib abs

Building Okinawan Lexicon Resource for Language Reclamation/Revitalization and Natural Language Processing Tasks such as Universal Dependencies Treebanking
So Miyagawa | Kanji Kato | Miho Zlazli | Salvatore Carlino | Seira Machida
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

The Open Multilingual Online Lexicon of Okinawan (OMOLO) project aims to create an accessible, user-friendly digital lexicon for the endangered Okinawan language using digital humanities tools and methodologies. The multilingual web application, available in Japanese, English, Portuguese, and Spanish, will benefit language learners, researchers, and the Okinawan community in Japan and diaspora countries such as the U.S., Brazil, and Peru. The project also lays the foundation for an Okinawan UD Treebank, which will support computational analysis and the development of language technology tools such as parsers, machine translation systems, and speech recognition software. The OMOLO project demonstrates the potential of computational linguistics in preserving and revitalizing endangered languages and can serve as a blueprint for similar initiatives.

pdf bib

Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

2019

pdf bib abs

With the increasing availability of wordnets for ancient languages, such as Ancient Greek and Latin, gaps remain in the coverage of less studied languages of antiquity. This paper reports on the construction and evaluation of a new wordnet for Coptic, the language of Late Roman, Byzantine and Early Islamic Egypt in the first millenium CE. We present our approach to constructing the wordnet which uses multilingual Coptic dictionaries and wordnets for five different languages. We further discuss the results of this effort and outline our on-going/future work.

Venues

SIGTYP1

Fix author

So Miyagawa

2025

2024

2023

2019

Co-authors

Venues