The Classical Language Toolkit: An NLP Framework for Pre-Modern Languages

This paper announces version 1.0 of the Classical Language Toolkit (CLTK), an NLP framework for pre-modern languages. The vast majority of NLP, its algorithms and software, is created with assumptions particular to living languages, thus neglecting certain important characteristics of largely non-spoken historical languages. Further, scholars of pre-modern languages often have different goals than those of living-language researchers. To fill this void, the CLTK adapts ideas from several leading NLP frameworks to create a novel software architecture that satisfies the unique needs of pre-modern languages and their researchers. Its centerpiece is a modular processing pipeline that balances the competing demands of algorithmic diversity with pre-configured defaults. The CLTK currently provides pipelines, including models, for almost 20 languages.


Introduction
Pre-modern (or historical) languages are linguistically no different than those with speakers living today. Differences, however, manifest in how premodern languages are preserved, to what extent they are preserved, how they may be analyzed, and the ends to which they are studied. NLP is comprised of "computational techniques for the purpose of learning, understanding, and producing human language content" (Hirschberg and Manning, 2015, 261). In principle, such techniques may be applied to pre-modern languages. But because NLP, its algorithms and software, presumes living languages, there remains a significant void for NLP for pre-modern languages.
The Classical Language Toolkit (CLTK) is a Python library that borrows ideas from state-of-theart NLP software, in order to cater to the particular needs of pre-modern languages and their re-searchers. 1 Its centerpiece is a modular processing pipeline that balances the competing demands of algorithmic diversity with pre-configured defaults. The CLTK currently provides pipelines, including models, for almost 20 languages. This architecture allows for relatively easy customization of currently available pipelines to new languages.

NLP for Pre-modern Languages
The authors adopt the term pre-modern to encompass the ISO 639-3 definitions of ancient (whose speakers died over 1,000 years ago), extinct (speakers who died within the last 200-300 years), and historic (distinct antecedents to living languages) (SIL International). The CLTK aims to treat all such languages, as they survive in written texts, from the 33rd century B.C. (Sumerian) up until the start of the A.D. 19th century. 2 Pre-modern languages have traits distinguishing them from living languages, including: • A finite corpus: Since native speakers no longer generate new texts, corpora may be too small for some machine learning algorithms, thus requiring rules-based or hybrid 1 http://cltk.org. Begun in 2014, v. 0.1 was a collection of user-submitted NLP algorithms, plus models, for about a dozen pre-modern languages. In this 1.0 release, the CLTK offers a standard API and pre-configured processing pipelines.  contains some earlier history and concepts behind v. 0.1. The MIT-licensed code is available in version control (https://github.com/cltk/ cltk) and packaged on PyPI (with pip install cltk).
2 This cutoff date need not be absolute, as the date of introduction of the printing press may be taken into consideration. The press, which spread asynchronously, normalizes orthography and reduces copyist errors (Eisenstein, 1979, 181-225), thus obviating need for some of the CLTK's tools. As orthography stabilizes, coming closer to contemporary usage, livinglanguage NLP becomes increasingly tractable. The Chinese movable type press (A.D. 11th century) could be considered an exception, though modern metal typefaces, with attendant productivity gains, were not applied to Chinese texts until the mid-19th century (Wilkinson, 2000, 451-453). The Sumerian date comes from (Michalowski, 2004, 19). approaches. In some cases, a language's corpus may be small enough that it can be fully annotated. 3 • Variation: Corpora of pre-modern languages are likely to demonstrate greater variation than living languages. This may include nonstandardized orthography, regional dialects, and temporal language change (over spans of hundreds and even thousands of years). 4 • Limited resources: Interest in pre-modern languages is largely scholarly or religious, meaning less funding from government and industry for the creation of resources such as text corpora, treebanks, and lexica.
These three differences spur the need for NLP specific to pre-modern languages.

Researchers of Pre-modern Languages
Researchers of pre-modern languages have concerns that are likely philological, linguistic, or pedagogical. Philology is an approach to pre-modern writing that focuses on the historical origins of texts; it is comparative as well as genealogical in nature (Turner, 2014, x). Historical linguists study diachronic change in a language itself, as opposed to philologists' focus upon written language. 5 Educators have unique concerns, too, including foremost that students generally do not learn by speaking and that they begin studying difficult, original texts within a year of study. In the classroom, a high premium is put upon sight translation, which is accomplished by the sub-tasks of identifying words' parts-of-speech, grammatical constructions, and lexical headwords. 6 These three objectives may find some representation among users of living-language NLP, 7 however they are not sig-3 As with Gothic, for which the only sizable evidence surviving is a 6th century manuscript containing a 4th century translation of the Bible (Miller, 2019, 1, 8-15), most of which the PROIEL project has annotated (Haug and Jøhndal, 2008). 4 Sumerian, for example, survived 3,000 years (Michalowski, 2004, 19). Piotrowski (2012, 14-22) introduces the categories of difference (diachronic spelling variation), variance (synchronic spelling variation), and uncertainty (information loss during digital transcription). 5 On linguists' focus on spoken language change: Hock (1991, 1-10) and Campbell (2013, 1-5); on contrast to philology: Hock (1991,(3)(4)(5) and Campbell (2013, 373, 391-392). Philology is fundamentally "intepretation of textual data" (Hock 1991, 5). 6 See Adams (2016) on the origins of this pedagogy in the English-speaking world. 7 E.g., for secondary language acquisition (Inniss et al., 2006) nificant stimuli to industrial and governmental research.

Previous Work
Two software architectural patterns, the framework and the pipeline, are most relevant to the CLTK's design.
As NLP matured in the early 2000's, frameworks (or toolkits) emerged with the purpose of making the technology easier for non-specialists to use. To this end, these frameworks generally have documentation friendly for beginners, value diversity in algorithms, treat multiple languages, provide data sets, help with text preprocessing, and provide pre-trained models. 8 Of these characteristics, the CLTK especially values multilingual and multi-algorithmic NLP, the latter of which being necessary to accommodate the varying state of data sets of pre-modern languages. The CLTK shows some especial similarity to the quanteda library for the R language (Benoit et al., 2018), as it contains novel algorithms yet also "wraps" other NLP libraries.
Several NLP frameworks have popularized the pipeline processing architecture, in which default algorithms (tokenization, POS tagging, dependency parsing, etc.) are run in series upon input text. Algorithms may be added or removed from a default pipeline. Increasingly, frameworks use identical algorithms for every language, without special consideration for a language's nuances.
Aside from the CLTK, NLP tools for premodern languages have been uncommon, 9 despite a steady growth of language resources. 10 Premodern languages are often low-resource. Lowresource software applications, however, have tended toward transcription 11 and, in the case of en-dangered languages, language preservation. 12 An interesting exception may be UralicNLP (Hämäläinen, 2019), which provides algorithms intended for relatively small data sets in Finnish and related languages.

System Design
An NLP pipeline within a framework architecture standardizes I/O while preserving algorithmic diversity. The CLTK should provide: • Modular processing pipelines: Each language should come with a pre-configured pipeline set to defaults expected by most users. A user should be able to modify, replace, and add processes to a pipeline. Pipelines may be adjusted for new languages.
• Diversity of algorithms: When there are several popular ways researchers perform a particular process (e.g., tagging entities with a word list or a neural network), the CLTK should support them both. Due to limited language resources, such as digitized texts and treebanks, machine learning at times may not be tractable (and if so, then only certain algorithms). 13 While rules-based approaches often do not adapt to the dynamism of living languages, they can perform well in restricted tasks within narrow domains. 14 • Standard I/O: To optimize user productivity and facilitate scholarly communication, an API should accept standard input for all human languages. Likewise, when linguistically justified, outputs should be expressed using data structures and representations that are shared across languages.

Architecture and Usage
The CLTK has one primary interface, NLP(), and five custom data types: When a user calls NLP. ⌋ analyze(), it outputs a Doc, which contains all processed information. At Doc.words is a list of Word objects, each of which contains tokenlevel information added by each Process. A Pipeline contains a list of Process objects for a given language.

NLP()
The CLTK's NLP() class offers a common interface for all languages, for which a pipeline of NLP algorithms is called. Calling analyze(), the class's only public method, triggers each Process in succession. The CLTK executes the algorithms and returns a Doc object. Code Block 1 illustrates its use. 15

Process
An algorithm in the CLTK may be called directly or wrapped in a Process that is incorporated into in a Pipeline. Each of the following classes, which inherit from Process, keep the project's algorithms organized according the kind of NLP they contain (Figure 1). 16 • NormalizeProcess: Reads Doc.raw, then does Unicode normalization and other text transformation as required per language; outputs to Doc.normalized_text.
• TokenizationProcess: Normally the first Process run, splits input string into word tokens; sets string value at Word.st ⌋ ring.
• StopsProcess: Checks whether a token is contained within a stopword list; adds Boolean value at Word.stop.  19 At time of publication, the CLTK uses the Stanza project's pretrained models with StanzaProcess. In the future, custom-trained models (e.g., with spaCy or Stanza) will be wrapped by DependencyProcess. See also section 3.4.4 for post-processing the flat Doc.words into a tree. 20 Using fastText embeddings for Arabic, Aramaic, Gothic, Latin, Old English, Pali, and Sanskrit (Bojanowski et al., 2016) for seven languages (see ft. 18).

Pipeline
A language has one Pipeline defining a list of Process objects, as illustrated in Code Block 2. The objects within Pipeline.processes are looped over when called by NLP.analyze(). Each time, a Doc is sent into the Process and a new Doc, now with an updated Doc.words, is produced. These algorithms are invoked by default, though a user may override them by declaring his own Pipeline and passing it to NLP(). At time of publication, 19 languages have pre-configured pipelines. 25
>>> print(cltk_doc.words [11] objects at Doc.words, which may be accessed directly or by helper methods, such as Doc. ⌋ tokens (returning a list of token strings) and Doc.embeddings (a list of arrays). When these access methods are not enough, a user may postprocess the Doc and add attributes to it or the Word objects within.

Word
Word stores all token information. Code Block 3 shows some of what a Word object may contain.

Language
The Pipeline references these classes (see Code Block 2).

MorphosyntacticFeature and MorphosyntacticFeatureBundle
Beyond the categorical information at Word.pos, a language's Pipeline adds complete morphology at the Word.features accessor (see Code Block 5). The sometimes arbitrary output strings of morphological taggers ("indicative," "Indic.," etc.) are mapped to these specific CLTK classes (inheriting from MorphosyntacticFeature) that represent all features defined by version 2 of the Universal Dependencies project. 27 Hence, different taggers resolve to a common annotation schema.

DependencyTree
The CLTK uses the "built-in" xml library to make trees for modeling dependency parses. A Word is mapped into a Form, then ElementTree is used to organize these into a DependencyTree (see Code Block 6).

FetchCorpus
Git repositories host models developed by CLTK contributors. 28 When the software cannot find a required model, FetchCorpus is invoked to download the required dependency and put it within the appropriate directory at~/cltk_data/. 29 27 Annotation guidelines at Universal Dependencies (2016) and CLTK objects at cltk/morphology/universal ⌋ _dependencies_annotations.py. 28 All CLTK models are stored on GitHub at: https:// github.com/cltk/?q=model. 29 A language-specific Git repository is available for most languages, e.g., "lat_models_cltk" at the URI h t >>> print(cltk_doc.words [11].featur ⌋ es)

Conclusion and Future Work
The architecture of the CLTK v. 1.0 has an engineering rigor necessary to model the world's several hundred pre-modern languages. Currently, it serves the basic, and several more advanced, needs of researchers for 19 languages. Software alone, however, is not sufficient. The CLTK lacks formal evaluations of its models' accuracies. At time of publication, most Process definitions wrap models trained by upstream projects (e.g., Stanza). While these projects report accuracies respective to their training sets (i.e., with crossvalidation), they do not provide evaluations against outside benchmarks. Unfortunately, such benchmarks do not yet exist for pre-modern languages, with the exception of the recent Sprugnoli et al.
• to make a TrainingPipeline, similar to the inference Pipeline, that would standardize the training of new models; • to normalize duplicative treebanks; 30 • and to develop Internet infrastructure for training and hosting models; These efforts will improve scientific procedure for pre-modern NLP.
Another initiative involves experimentation with transfer learning, along the lines of Multilingual BERT (Pires et al., 2019), training on all surviving pre-modern texts. Because languages are related and because texts, even in different languages, often share entities, information sharing may prove felicitous. 31 The pre-modern world, its languages and peoples, was deeply networked. 32 The CLTK is a comprehensive collection of NLP technologies to support the study of this history. the Natural Language Toolkit (NLTK), on which v. 0.1 heavily relied. The CLTK logo of a Phoenician aleph (or ʾālep, ), being the first letter of the first alphabet, was created by Pierre-Marie Pédrot. 34