Johann-Mattis List - ACL Anthology

Johann-Mattis List

2026

Using Correspondence Patterns to Identify Irregular Words in Cognate Sets Through Leave-One-Out Validation
Frederic Blum | Johann-Mattis List
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)

Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.

2025

Using Cross-Linguistic Data Formats to Enhance the Annotation of Ancient Chinese Documents Written on Bamboo Slips
Michele Pulini | Johann-Mattis List
Proceedings of the Second Workshop on Ancient Language Processing

Ancient Chinese documents written on bam-boo slips more than 2000 years ago offer a rich resource for research in linguistics, paleogra-phy, and historiography. However, since most documents are only available in the form of scans, additional steps of analysis are needed to turn them into interactive digital editions, amenable both for manual and computational exploration. Here, we present a first attempt to establish a workflow for the annotation of an-cient bamboo slips. Based on a recently redis-covered dialogue on warfare, we illustrate how a digital edition amenable for manual and com-putational exploration can be created by inte-grating standards originally designed for cross-linguistic data collections.

From Isolates to Families: Using Neural Networks for Automated Language Affiliation
Frederic Blum | Steffen Herbold | Johann-Mattis List
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In historical linguistics, the affiliation of languages to a common language family is traditionally carried out using a complex workflow that relies on manually comparing individual languages. Large-scale standardized collections of multilingual wordlists and grammatical language structures might help to improve this and open new avenues for developing automated language affiliation workflows. Here, we present neural network models that use lexical and grammatical data from a worldwide sample of more than 1,200 languages with known affiliations to classify individual languages into families. In line with the traditional assumption of most linguists, our results show that models trained on lexical data alone outperform models solely based on grammatical data, whereas combining both types of data yields even better performance. In additional experiments, we show how our models can identify long-ranging relations between entire subgroups, how they can be employed to investigate potential relatives of linguistic isolates, and how they can help us to obtain first hints on the affiliation of so far unaffiliated languages. We conclude that models for automated language affiliation trained on lexical and grammatical data provide comparative linguists with a valuable tool for evaluating hypotheses about deep and unknown language relations.

Annotating and Inferring Compositional Structures in Numeral Systems Across Languages
Arne Rubehn | Christoph Rzymski | Luca Ciucci | Katja Bocklage | Alžběta Kučerová | David Snee | Abishek Stephen | Kellen Parker van Dam | Johann-Mattis List
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Numeral systems across the world’s languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data
Annika Tjuka | Robert Forkel | Christoph Rzymski | Johann-Mattis List
Proceedings of the 16th International Conference on Computational Semantics

Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.

Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists
David Snee | Luca Ciucci | Arne Rubehn | Kellen Parker van Dam | Johann-Mattis List
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Multilingual wordlists play a crucial role in comparative linguistics. While many studies have been carried out to test the power of computational methods for language subgrouping or divergence time estimation, few studies have put the data upon which these studies are based to a rigorous test. Here, we conduct a first experiment that tests the robustness of concept translation as an integral part of the compilation of multilingual wordlists. Investigating the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families, we find that on average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases. Our findings can prove important when trying to assess the uncertainty of phylogenetic studies and the conclusions derived from them.

Partial Colexifications Improve Concept Embeddings
Arne Rubehn | Johann-Mattis List
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

While the embedding of words has revolutionized the field of Natural Language Processing, the embedding of concepts has received much less attention so far. A dense and meaningful representation of concepts, however, could prove useful for several tasks in computational linguistics, especially those involving cross-linguistic data or sparse data from low resource languages. First methods that have been proposed so far embed concepts from automatically constructed colexification networks. While these approaches depart from automatically inferred polysemies, attested across a larger number of languages, they are restricted to the word level, ignoring lexical relations that would only hold for parts of the words in a given language. Building on recently introduced methods for the inference of partial colexifications, we show how they can be used to improve concept embeddings in meaningful ways. The learned embeddings are evaluated against lexical similarity ratings, recorded instances of semantic shift, and word association data. We show that in all evaluation tasks, the inclusion of partial colexifications lead to improved concept representations and better results. Our results further show that the learned embeddings are able to capture and represent different semantic relationships between concepts.

Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages
Alžběta Kučerová | Johann-Mattis List
Proceedings of the 13th Global Wordnet Conference

2024

Invited paper: Computer-Assisted Language Comparison with EDICTOR 3
Johann-Mattis List | Kellen van Dam
Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change

Generating Feature Vectors from Phonetic Transcriptions in Cross-Linguistic Data Formats
Arne Rubehn | Jessica Nieder | Robert Forkel | Johann-Mattis List
Proceedings of the Society for Computation in Linguistics 2024

Resource Acquisition for Understudied Languages: Extracting Wordlists from Dictionaries for Computer-assisted Language Comparison
Frederic Blum | Johannes Englisch | Alba Hermida Rodriguez | Rik van Gijn | Johann-Mattis List
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024

Comparative wordlists play a crucial role for historical language comparison. They are regularly used for the identification of related words and languages, or for the reconstruction of language phylogenies and proto-languages. While automated solutions exist for the majority of methods used for this purpose, no standardized computational or computer-assisted approaches for the compilation of comparative wordlists have been proposed so far. Up to today, scholars compile wordlists by sifting manually through dictionaries or similar language resources and typing them into spreadsheets. In this study we present a semi-automatic approach to extract wordlists from machine-readable dictionaries. The transparent workflow allows to build user-defined wordlists for individual languages in a standardized format. By automating the search for translation equivalents in dictionaries, our approach greatly facilitates the aggregation of individual resources into multilingual comparative wordlists that can be used for a variety of purposes.

First Steps Towards the Integration of Resources on Historical Glossing Traditions in the History of Chinese: A Collection of Standardized Fǎnqiè Spellings from the Guǎngyùn
Michele Pulini | Johann-Mattis List
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Due to the peculiar nature of the Chinese writing system, it is difficult to assess the pronunciation of historical varieties of Chinese. In order to reconstruct ancient pronunciations, historical glossing practices play a crucial role. However, although studied thoroughly by numerous scholars, most research has been carried out in a qualitative manner, and no attempt at providing integrated resources of historical glossing practices has been made so far. Here, we present a first step towards the integration of resources on historical glossing traditions in the history of Chinese. Our starting point are so-called fǎnqiè spellings in the Guǎngyùn, one of the early rhyme books in the history of Chinese, providing pronunciations for more than 20000 Chinese characters. By standardizing digital versions of the resource using tools from computational historical linguistics, we show that we can predict historical spellings with high precision and at the same time shed light on the precision of ancient glossing practices. Although a considerably small first step, our resource could be the starting point for an integrated, standardized collection that could ultimately shed new light on the history of Chinese.

A Computational Model for the Assessment of Mutual Intelligibility Among Closely Related Languages
Jessica Nieder | Johann-Mattis List
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Closely related languages show linguistic similarities that allow speakers of one language to understand speakers of another language without having actively learned it. Mutual intelligibility varies in degree and is typically tested in psycholinguistic experiments. To study mutual intelligibility computationally, we propose a computer-assisted method using the Linear Discriminative Learner, a computational model developed to approximate the cognitive processes by which humans learn languages, which we expand with multilingual semantic vectors and multilingual sound classes. We test the model on cognate data from German, Dutch, and English, three closely related Germanic languages. We find that our model’s comprehension accuracy depends on 1) the automatic trimming of inflections and 2) the language pair for which comprehension is tested. Our multilingual modelling approach does not only offer new methodological findings for automatic testing of mutual intelligibility across languages but also extends the use of Linear Discriminative Learning to multilingual settings.

Linguistic Survey of India and Polyglotta Africana: Two Retrostandardized Digital Editions of Large Historical Collections of Multilingual Wordlists
Robert Forkel | Johann-Mattis List | Christoph Rzymski | Guillaume Segerer
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The Linguistic Survey of India (LSI) and the Polyglotta Africana (PA) are two of the largest historical collections of multilingual wordlists. While the originally printed editions have long since been digitized and shared in various forms, no editions in which the original data is presented in standardized form, comparable with contemporary wordlist collections, have been produced so far. Here we present digital retro-standardized editions of both sources. For maximal interoperability with datasets such as Lexibank the two datasets have been converted to CLDF, the standard proposed by the Cross-Linguistic Data Formats initiative. In this way, an unambiguous identification of the three main constituents of wordlist data – language, concept and segments used for transcription – is ensured through links to the respective reference catalogs, Glottolog, Concepticon and CLTS. At this level of interoperability, legacy material such as LSI and PA may provide a reasonable complementary source for language documentation, filling in gaps where original documentation is not possible anymore.

Are Sounds Sound for Phylogenetic Reconstruction?
Luise Häuser | Gerhard Jäger | Johann-Mattis List | Taraka Rama | Alexandros Stamatakis
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

2023

Detecting Lexical Borrowings from Dominant Languages in Multilingual Wordlists
John E. Miller | Johann-Mattis List
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Language contact is a pervasive phenomenon reflected in the borrowing of words from donor to recipient languages. Most computational approaches to borrowing detection treat all languages under study as equally important, even though dominant languages have a stronger impact on heritage languages than vice versa. We test new methods for lexical borrowing detection in contact situations where dominant languages play an important role, applying two classical sequence comparison methods and one machine learning method to a sample of seven Latin American languages which have all borrowed extensively from Spanish. All systems perform well, with the supervised machine learning system outperforming the classical systems. A review of detection errors shows that borrowing detection could be substantially improved by taking into account donor words with divergent meanings from recipient words.

Information-Theoretic Characterization of Vowel Harmony: A Cross-Linguistic Study on Word Lists
Julius Steuer | Johann-Mattis List | Badr M. Abdullah | Dietrich Klakow
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

We present a cross-linguistic study of vowel harmony that aims to quantifies this phenomenon using data-driven computational modeling. Concretely, we define an information-theoretic measure of harmonicity based on the predictability of vowels in a natural language lexicon, which we estimate using phoneme-level language models (PLMs). Prior quantitative studies have heavily relied on inflected word-forms in the analysis on vowel harmony. On the contrary, we train our models using cross-linguistically comparable lemma forms with little or no inflection, which enables us to cover more under-studied languages. Training data for our PLMs consists of word lists offering a maximum of 1000 entries per language. Despite the fact that the data we employ are substantially smaller than previously used corpora, our experiments demonstrate the neural PLMs capture vowel harmony patterns in a set of languages that exhibit this phenomenon. Our work also demonstrates that word lists are a valuable resource for typological research, and offers new possibilities for future studies on low-resource, under-studied languages.

Representing and Computing Uncertainty in Phonological Reconstruction
Johann-Mattis List | Nathan Hill | Robert Forkel | Frederic Blum
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

Despite the inherently fuzzy nature of reconstructions in historical linguistics, most scholars do not represent their uncertainty when proposing proto-forms. With the increasing success of recently proposed approaches to automating certain aspects of the traditional comparative method, the formal representation of proto-forms has also improved. This formalization makes it possible to address both the representation and the computation of uncertainty. Building on recent advances in supervised phonological reconstruction, during which an algorithm learns how to reconstruct words in a given proto-language relying on previously annotated data, and inspired by improved methods for automated word prediction from cognate sets, we present a new framework that allows for the representation of uncertainty in linguistic reconstruction and also includes a workflow for the computation of fuzzy reconstructions from linguistic data.

Trimming Phonetic Alignments Improves the Inference of Sound Correspondence Patterns from Multilingual Wordlists
Frederic Blum | Johann-Mattis List
Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Sound correspondence patterns form the basis of cognate detection and phonological reconstruction in historical language comparison. Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed, but their application to multilingual wordlists requires extremely well annotated datasets. Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically. Taking inspiration from trimming techniques in evolutionary biology, which improve alignments by excluding problematic sites, we propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns. Testing these techniques on a large standardized collection of ten datasets with expert annotations from different language families, we find that the best trimming technique substantially improves the overall consistency of the alignments, showing a clear increase in the proportion of frequent correspondence patterns and words exhibiting regular cognate relations.

2022

A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns
Johann-Mattis List | Robert Forkel | Nathan Hill
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new dataset covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand.

The SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes
Johann-Mattis List | Ekaterina Vylomova | Robert Forkel | Nathan Hill | Ryan Cotterell
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

This study describes the structure and the results of the SIGTYP 2022 shared task on the prediction of cognate reflexes from multilingual wordlists. We asked participants to submit systems that would predict words in individual languages with the help of cognate words from related languages. Training and surprise data were based on standardized multilingual wordlists from several language families. Four teams submitted a total of eight systems, including both neural and non-neural systems, as well as systems adjusted to the task and systems using more general settings. While all systems showed a rather promising performance, reflecting the overwhelming regularity of sound change, the best performance throughout was achieved by a system based on convolutional networks originally designed for image restoration.

2020

CLDFBench: Give Your Cross-Linguistic Data a Lift
Robert Forkel | Johann-Mattis List
Proceedings of the Twelfth Language Resources and Evaluation Conference

While the amount of cross-linguistic data is constantly increasing, most datasets produced today and in the past cannot be considered FAIR (findable, accessible, interoperable, and reproducible). To remedy this and to increase the comparability of cross-linguistic resources, it is not enough to set up standards and best practices for data to be collected in the future. We also need consistent workflows for the “retro-standardization” of data that has been published during the past decades and centuries. With the Cross-Linguistic Data Formats initiative, first standards for cross-linguistic data have been presented and successfully tested. So far, however, CLDF creation was hampered by the fact that it required a considerable degree of computational proficiency. With cldfbench, we introduce a framework for the retro-standardization of legacy data and the curation of new datasets that drastically simplifies the creation of CLDF by providing a consistent, reproducible workflow that rigorously supports version control and long term archiving of research data and code. The framework is distributed in form of a Python package along with usage information and examples for best practice. This study introduces the new framework and illustrates how it can be applied by showing how a resource containing structural and lexical data for Sinitic languages can be efficiently retro-standardized and analyzed.

2019

An Automated Framework for Fast Cognate Detection and Bayesian Phylogenetic Inference in Computational Historical Linguistics
Taraka Rama | Johann-Mattis List
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a fully automated workflow for phylogenetic reconstruction on large datasets, consisting of two novel methods, one for fast detection of cognates and one for fast Bayesian phylogenetic inference. Our results show that the methods take less than a few minutes to process language families that have so far required large amounts of time and computational power. Moreover, the cognates and the trees inferred from the method are quite close, both to gold standard cognate judgments and to expert language family trees. Given its speed and ease of application, our framework is specifically useful for the exploration of very large datasets in historical linguistics.

Automatic Inference of Sound Correspondence Patterns across Multiple Languages
Johann-Mattis List
Computational Linguistics, Volume 45, Issue 1 - March 2019

Sound correspondence patterns play a crucial role for linguistic reconstruction. Linguists use them to prove language relationship, to reconstruct proto-forms, and for classical phylogenetic reconstruction based on shared innovations. Cognate words that fail to conform with expected patterns can further point to various kinds of exceptions in sound change, such as analogy or assimilation of frequent words. Here I present an automatic method for the inference of sound correspondence patterns across multiple languages based on a network approach. The core idea is to represent all columns in aligned cognate sets as nodes in a network with edges representing the degree of compatibility between the nodes. The task of inferring all compatible correspondence sets can then be handled as the well-known minimum clique cover problem in graph theory, which essentially seeks to split the graph into the smallest number of cliques in which each node is represented by exactly one clique. The resulting partitions represent all correspondence patterns that can be inferred for a given data set. By excluding those patterns that occur in only a few cognate sets, the core of regularly recurring sound correspondences can be inferred. Based on this idea, the article presents a method for automatic correspondence pattern recognition, which is implemented as part of a Python library which supplements the article. To illustrate the usefulness of the method, I present how the inferred patterns can be used to predict words that have not been observed before.

2018

Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?
Taraka Rama | Johann-Mattis List | Johannes Wahle | Gerhard Jäger
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We evaluate the performance of state-of-the-art algorithms for automatic cognate detection by comparing how useful automatically inferred cognates are for the task of phylogenetic inference compared to classical manually annotated cognate sets. Our findings suggest that phylogenies inferred from automated cognate sets come close to phylogenies inferred from expert-annotated ones, although on average, the latter are still superior. We conclude that future work on phylogenetic reconstruction can profit much from automatic cognate detection. Especially where scholars are merely interested in exploring the bigger picture of a language family’s phylogeny, algorithms for automatic cognate detection are a useful complement for current research on language phylogenies.

2017

Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists
Gerhard Jäger | Johann-Mattis List | Pavel Sofroniev
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which is time consuming and as of yet only available for a small number of well-studied language families. Automatizing this step will greatly expand the empirical scope of phylogenetic methods in linguistics, as raw wordlists (in phonetic transcription) are much easier to obtain than wordlists in which cognate words have been fully identified and annotated, even for under-studied languages. A couple of different methods have been proposed in the past, but they are either disappointing regarding their performance or not applicable to larger datasets. Here we present a new approach that uses support vector machines to unify different state-of-the-art methods for phonetic alignment and cognate detection within a single framework. Training and evaluating these method on a typologically broad collection of gold-standard data shows it to be superior to the existing state of the art.

A Web-Based Interactive Tool for Creating, Inspecting, Editing, and Publishing Etymological Datasets
Johann-Mattis List
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

The paper presents the Etymological DICtionary ediTOR (EDICTOR), a free, interactive, web-based tool designed to aid historical linguists in creating, editing, analysing, and publishing etymological datasets. The EDICTOR offers interactive solutions for important tasks in historical linguistics, including facilitated input and segmentation of phonetic transcriptions, quantitative and qualitative analyses of phonetic and morphological data, enhanced interfaces for cognate class assignment and multiple word alignment, and automated evaluation of regular sound correspondences. As a web-based tool written in JavaScript, the EDICTOR can be used in standard web browsers across all major platforms.

2016

Using Sequence Similarity Networks to Identify Partial Cognates in Multilingual Wordlists
Johann-Mattis List | Philippe Lopez | Eric Bapteste
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Concepticon: A Resource for the Linking of Concept Lists
Johann-Mattis List | Michael Cysouw | Robert Forkel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an attempt to link the large amount of different concept lists which are used in the linguistic literature, ranging from Swadesh lists in historical linguistics to naming tests in clinical studies and psycholinguistics. This resource, our Concepticon, links 30 222 concept labels from 160 conceptlists to 2495 concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.

2014

A Benchmark Database of Phonetic Alignments in Historical Linguistics and Dialectology
Johann-Mattis List | Jelena Prokić
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In the last two decades, alignment analyses have become an important technique in quantitative historical linguistics and dialectology. Phonetic alignment plays a crucial role in the identification of regular sound correspondences and deeper genealogical relations between and within languages and language families. Surprisingly, up to today, there are no easily accessible benchmark data sets for phonetic alignment analyses. Here we present a publicly available database of manually edited phonetic alignments which can serve as a platform for testing and improving the performance of automatic alignment algorithms. The database consists of a great variety of alignments drawn from a large number of different sources. The data is arranged in a such way that typical problems encountered in phonetic alignment analyses (metathesis, diversity of phonetic sequences) are represented and can be directly tested.

2013

An Open Source Toolkit for Quantitative Historical Linguistics
Johann-Mattis List | Steven Moran
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Using Network Approaches to Enhance the Analysis of Cross-Linguistic Polysemies
Johann-Mattis List | Anselm Terhalle | Matthias Urban
Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Short Papers

2012

LexStat: Automatic Detection of Cognates in Multilingual Wordlists
Johann-Mattis List
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

Venues