@inproceedings{kobayashi-etal-2025-leveraging,
title = "Leveraging High-Resource {E}nglish Corpora for Cross-lingual Domain Adaptation in Low-Resource {J}apanese Medicine via Continued Pre-training",
author = "Kobayashi, Kazuma and
Wan, Zhen and
Cheng, Fei and
Tsuta, Yuma and
Zhao, Xin and
Jiang, Junfeng and
Huang, Jiahao and
Huang, Zhiyi and
Oda, Yusuke and
Yokota, Rio and
Arase, Yuki and
Kawahara, Daisuke and
Aizawa, Akiko and
Kurohashi, Sadao",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.615/",
pages = "11469--11488",
ISBN = "979-8-89176-335-7",
abstract = "Limited low-resource language corpora in professional domains like medicine hinder cross-lingual domain adaptation of pre-trained large language models (PLMs). While abundant English medical corpora could complement this scarcity, the effective mixture of English and target language, including machine-translated content, remains underexplored. We examined how linguistic features (e.g., token sizes and language proportions) affect performance on a Japanese{--}English medical knowledge benchmark. Through continued pre-training of a bilingual PLM on multilingual corpora with varying proportions of English and Japanese texts (both original and machine-translated), we analyzed correlations between linguistic features and fine-grained task performance. Our findings suggest a practical approach to optimizing multilingual corpora for cross-lingual domain adaptation, which requires leveraging specialized knowledge from English corpora while ensuring sufficient coverage of language-specific expressions in a target language (Japanese). Such insights will contribute to the development of multilingual models that effectively leverage English-language resources in various professional domains with low-resource languages."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="kobayashi-etal-2025-leveraging">
<titleInfo>
<title>Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training</title>
</titleInfo>
<name type="personal">
<namePart type="given">Kazuma</namePart>
<namePart type="family">Kobayashi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zhen</namePart>
<namePart type="family">Wan</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Fei</namePart>
<namePart type="family">Cheng</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yuma</namePart>
<namePart type="family">Tsuta</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Xin</namePart>
<namePart type="family">Zhao</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Junfeng</namePart>
<namePart type="family">Jiang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Jiahao</namePart>
<namePart type="family">Huang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zhiyi</namePart>
<namePart type="family">Huang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yusuke</namePart>
<namePart type="family">Oda</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Rio</namePart>
<namePart type="family">Yokota</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yuki</namePart>
<namePart type="family">Arase</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Daisuke</namePart>
<namePart type="family">Kawahara</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Akiko</namePart>
<namePart type="family">Aizawa</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Sadao</namePart>
<namePart type="family">Kurohashi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2025-11</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Findings of the Association for Computational Linguistics: EMNLP 2025</title>
</titleInfo>
<name type="personal">
<namePart type="given">Christos</namePart>
<namePart type="family">Christodoulopoulos</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Tanmoy</namePart>
<namePart type="family">Chakraborty</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Carolyn</namePart>
<namePart type="family">Rose</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Violet</namePart>
<namePart type="family">Peng</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Suzhou, China</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
<identifier type="isbn">979-8-89176-335-7</identifier>
</relatedItem>
<abstract>Limited low-resource language corpora in professional domains like medicine hinder cross-lingual domain adaptation of pre-trained large language models (PLMs). While abundant English medical corpora could complement this scarcity, the effective mixture of English and target language, including machine-translated content, remains underexplored. We examined how linguistic features (e.g., token sizes and language proportions) affect performance on a Japanese–English medical knowledge benchmark. Through continued pre-training of a bilingual PLM on multilingual corpora with varying proportions of English and Japanese texts (both original and machine-translated), we analyzed correlations between linguistic features and fine-grained task performance. Our findings suggest a practical approach to optimizing multilingual corpora for cross-lingual domain adaptation, which requires leveraging specialized knowledge from English corpora while ensuring sufficient coverage of language-specific expressions in a target language (Japanese). Such insights will contribute to the development of multilingual models that effectively leverage English-language resources in various professional domains with low-resource languages.</abstract>
<identifier type="citekey">kobayashi-etal-2025-leveraging</identifier>
<location>
<url>https://aclanthology.org/2025.findings-emnlp.615/</url>
</location>
<part>
<date>2025-11</date>
<extent unit="page">
<start>11469</start>
<end>11488</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training
%A Kobayashi, Kazuma
%A Wan, Zhen
%A Cheng, Fei
%A Tsuta, Yuma
%A Zhao, Xin
%A Jiang, Junfeng
%A Huang, Jiahao
%A Huang, Zhiyi
%A Oda, Yusuke
%A Yokota, Rio
%A Arase, Yuki
%A Kawahara, Daisuke
%A Aizawa, Akiko
%A Kurohashi, Sadao
%Y Christodoulopoulos, Christos
%Y Chakraborty, Tanmoy
%Y Rose, Carolyn
%Y Peng, Violet
%S Findings of the Association for Computational Linguistics: EMNLP 2025
%D 2025
%8 November
%I Association for Computational Linguistics
%C Suzhou, China
%@ 979-8-89176-335-7
%F kobayashi-etal-2025-leveraging
%X Limited low-resource language corpora in professional domains like medicine hinder cross-lingual domain adaptation of pre-trained large language models (PLMs). While abundant English medical corpora could complement this scarcity, the effective mixture of English and target language, including machine-translated content, remains underexplored. We examined how linguistic features (e.g., token sizes and language proportions) affect performance on a Japanese–English medical knowledge benchmark. Through continued pre-training of a bilingual PLM on multilingual corpora with varying proportions of English and Japanese texts (both original and machine-translated), we analyzed correlations between linguistic features and fine-grained task performance. Our findings suggest a practical approach to optimizing multilingual corpora for cross-lingual domain adaptation, which requires leveraging specialized knowledge from English corpora while ensuring sufficient coverage of language-specific expressions in a target language (Japanese). Such insights will contribute to the development of multilingual models that effectively leverage English-language resources in various professional domains with low-resource languages.
%U https://aclanthology.org/2025.findings-emnlp.615/
%P 11469-11488
Markdown (Informal)
[Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training](https://aclanthology.org/2025.findings-emnlp.615/) (Kobayashi et al., Findings 2025)
ACL
- Kazuma Kobayashi, Zhen Wan, Fei Cheng, Yuma Tsuta, Xin Zhao, Junfeng Jiang, Jiahao Huang, Zhiyi Huang, Yusuke Oda, Rio Yokota, Yuki Arase, Daisuke Kawahara, Akiko Aizawa, and Sadao Kurohashi. 2025. Leveraging High-Resource English Corpora for Cross-lingual Domain Adaptation in Low-Resource Japanese Medicine via Continued Pre-training. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11469–11488, Suzhou, China. Association for Computational Linguistics.