Daniel G. Swanson

Also published as: Daniel Swanson

2025

pdf bib abs
Towards Natural Language Explanations of Constraint Grammar Rules
Daniel Swanson
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

This paper presents a general-purpose parser for static analysis of Constraint Grammar rules (that is, examining only the rules, not potential inputs and outputs) and applies it to the task of translating rules into comprehensible explanations of behavior. An interactive interface for exploring how individual components of each rule contribute to these translations is also presented.

2024

pdf bib abs
Computational Language Documentation: Designing a Modular Annotation and Data Management Tool for Cross-cultural Applicability
Alexandra O’Neil | Daniel Swanson | Shobhana Chelliah
Proceedings of the 2nd Workshop on Cross-Cultural Considerations in NLP

While developing computational language documentation tools, researchers must center the role of language communities in the process by carefully reflecting on and designing tools to support the varying needs and priorities of different language communities. This paper provides an example of how cross-cultural considerations discussed in literature about language documentation, data sovereignty, and community-led documentation projects can motivate the design of a computational language documentation tool by reflecting on our design process as we work towards developing an annotation and data management tool. We identify three recurring themes for cross-cultural consideration in the literature - Linguistic Sovereignty, Cultural Specificity, and Reciprocity - and present eight essential features for an annotation and data management tool that reflect these themes.

pdf bib abs
Converting Legacy Data to CLDF: A FAIR Exit Strategy for Linguistic Web Apps
Robert Forkel | Daniel Swanson | Steven Moran
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In the mid 2000s, there were several large-scale US National Science Foundation (NSF) grants awarded to projects aiming at developing digital infrastructure and standards for different forms of linguistics data. For example, MultiTree encoded language family trees as phylogenies in XML and LL-MAP converted detailed geographic maps of endangered languages into KML. As early stand-alone website applications, these projects allowed researchers interested in comparative linguistics to explore language genealogies and areality, respectively. However as time passed, the technologies that supported these web apps became deprecated, unsupported, and inaccessible. Here we take a future-oriented approach to digital obsolescence and illustrate how to convert legacy linguistic resources into FAIR data via the Cross-Linguistic Data Formats (CLDF). CLDF is built on the W3C recommendations Model for Tabular Data and Metadata on the Web and Metadata Vocabulary for Tabular Data developed by the CSVW (CSV on the Web) working group. Thus, each dataset is modeled as a set of tabular data files described by metadata in JSON. These standards and the tools built to validate and manipulate them provide an accessible and extensible format for converting legacy linguistic web apps into FAIR datasets.

pdf bib abs
Producing a Parallel Universal Dependencies Treebank of Ancient Hebrew and Ancient Greek via Cross-Lingual Projection
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper we present the initial construction of a treebank of Ancient Greek containing portions of the Septuagint, a translation of the Hebrew Scriptures (1576 sentences, 39K tokens, roughly 7% of the total corpus). We construct the treebank by word-aligning and projecting from the parallel text in Ancient Hebrew before automatically correcting systematic syntactic mismatches and manually correcting other errors.

pdf bib abs
Towards Named-Entity and Coreference Annotation of the Hebrew Bible
Daniel G. Swanson | Bryce D. Bussert | Francis Tyers
Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024

Named-entity annotation refers to the process of specifying what real-world (or, at least, external-to-the-text) entities various names and descriptions within a text refer to. Coreference annotation, meanwhile, specifies what context-dependent words or phrases, such as pronouns refer to. This paper describes an ongoing project to apply both of these to the Hebrew Bible, so far covering most of the book of Genesis, fully marking every person, place, object, and point in time which occurs in the text. The annotation process and possible future uses for the data are covered, along with the challenges involved in applying existing annotation guidelines to the Hebrew text.

2023

pdf bib abs
WITH Context: Adding Rule-Grouping to VISL CG-3
Daniel Swanson | Tino Didriksen | Francis M. Tyers
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications

This paper presents an extension to the VISL CG-3 compiler and processor which enables complex contexts to be shared between rules. This sharing substantially improves the readability and maintainability of sets of rules performing multi-step operations.

pdf bib abs
Comparing methods of orthographic conversion for Bàsàá, a language of Cameroon
Alexandra O’neil | Daniel Swanson | Robert Pugh | Francis Tyers | Emmanuel Ngue Um
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)

Orthographical standardization is a milestone in a language’s documentation and the development of its resources. However, texts written in former orthographies remain relevant to the language’s history and development and therefore must be converted to the standardized orthography. Ensuring a language has access to the orthographically standardized version of all of its recorded texts is important in the development of resources as it provides additional textual resources for training, supports contribution of authors using former writing systems, and provides information about the development of the language. This paper evaluates the performance of natural language processing methods, specifically Finite State Transducers and Long Short-term Memory networks, for the orthographical conversion of Bàsàá texts from the Protestant missionary orthography to the now-standard AGLC orthography, with the conclusion that LSTMs are somewhat more effective in the absence of explicit lexical information.

2022

pdf bib abs
A Free/Open-Source Morphological Transducer for Western Armenian
Hossep Dolatian | Daniel Swanson | Jonathan Washington
Proceedings of the Workshop on Processing Language Variation: Digital Armenian (DigitAm) within the 13th Language Resources and Evaluation Conference

We present a free/open-source morphological transducer for Western Armenian, an endangered and low-resource Indo-European language. The transducer has virtually complete coverage of the language’s inflectional morphology. We built the lexicon by scraping online dictionaries. As of submission, the transducer has a lexicon of 75K words. It has over 90% naive coverage on different Western Armenian corpora, and high precision.

pdf bib abs
A Universal Dependencies Treebank of Ancient Hebrew
Daniel G. Swanson | Francis M. Tyers
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper we present the initial construction of a Universal Dependencies treebank with morphological annotations of Ancient Hebrew containing portions of the Hebrew Scriptures (1579 sentences, 27K tokens) for use in comparative study with ancient translations and for analysis of the development of Hebrew syntax. We construct this treebank by applying a rule-based parser (300 rules) to an existing morphologically-annotated corpus with minimal constituency structure and manually verifying the output and present the results of this semi-automated annotation process and some of the annotation decisions made in the process of applying the UD guidelines to a new language.

pdf bib abs
Handling Stress in Finite-State Morphological Analyzers for Ancient Greek and Ancient Hebrew
Daniel G. Swanson | Francis M. Tyers
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

Modeling stress placement has historically been a challenge for computational morphological analysis, especially in finite-state systems because lexically conditioned stress cannot be modeled using only rewrite rules on the phonological form of a word. However, these phenomena can be modeled fairly easily if the lexicon’s internal representation is allowed to contain more information than the pure phonological form. In this paper we describe the stress systems of Ancient Greek and Ancient Hebrew and we present two prototype finite-state morphological analyzers, one for each language, which successfully implement these stress systems by inserting a small number of control characters into the phonological form, thus conclusively refuting the claim that finite-state systems are not powerful enough to model such stress systems and arguing in favor of the continued relevance of finite-state systems as an appropriate tool for modeling the morphology of historical languages.

Co-authors

Jonathan Washington 1

Venues

ws4
lrec3
cgmta2
coling2
lt4hala2
show all...