Morpheme Induction for Emergent Language

Brendon Boldt; David R. Mortensen

doi:10.18653/v1/2025.emnlp-main.1284

Morpheme Induction for Emergent Language

Abstract

We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings.It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat).The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks.Second, we validate CSAR’s performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains.Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.

Anthology ID:: 2025.emnlp-main.1284
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 25264–25279
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1284/
DOI:: 10.18653/v1/2025.emnlp-main.1284
Bibkey:
Cite (ACL):: Brendon Boldt and David R. Mortensen. 2025. Morpheme Induction for Emergent Language. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25264–25279, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Morpheme Induction for Emergent Language (Boldt & Mortensen, EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1284.pdf
Checklist:: 2025.emnlp-main.1284.checklist.pdf

PDF Cite Search Checklist Fix data