SungHo Kim
2024
SEED: Semantic Knowledge Transfer for Language Model Adaptation to Materials Science
Yeachan Kim
|
Jun-Hyung Park
|
SungHo Kim
|
Juhyeong Park
|
Sangyun Kim
|
SangKeun Lee
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Materials science is an interdisciplinary field focused on studying and discovering materials around us. However, due to the vast space of materials, datasets in this field are typically scarce and have limited coverage. This inherent limitation makes current adaptation methods less effective when adapting pre-trained language models (PLMs) to materials science, as these methods rely heavily on the frequency information from limited downstream datasets. In this paper, we propose Semantic Knowledge Transfer (SEED), a novel vocabulary expansion method to adapt the pre-trained language models for materials science. The core strategy of SEED is to transfer the materials knowledge of lightweight embeddings into the PLMs. To this end, we introduce knowledge bridge networks, which learn to transfer the latent knowledge of the materials embeddings into ones compatible with PLMs. By expanding the embedding layer of PLMs with these transformed embeddings, PLMs can comprehensively understand the complex terminology associated with materials science. We conduct extensive experiments across a broad range of materials-related benchmarks. Comprehensive evaluation results convincingly demonstrate that SEED mitigates the mentioned limitations of previous adaptation methods, showcasing the efficacy of transferring embedding knowledge into PLMs.
KOMBO: Korean Character Representations Based on the Combination Rules of Subcharacters
SungHo Kim
|
Juhyeong Park
|
Yeachan Kim
|
SangKeun Lee
Findings of the Association for Computational Linguistics: ACL 2024
The Korean writing system, Hangeul, has a unique character representation rigidly following the invention principles recorded in Hunminjeongeum. However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of Hangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: https://github.com/SungHo3268/KOMBO.