Compositionality of Complex Graphemes in the Undeciphered Proto-Elamite Script using Image and Text Embedding Models

We introduce a language modeling architecture which operates over sequences of images, or over multimodal sequences of images with associated labels. We use this architecture alongside other embedding models to investigate a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite script. We argue that CGs have meanings which are at least partly compositional, and we discover novel rules governing the construction of CGs. We ﬁnd that a language model over sign images produces more inter-pretable results than a model over text or over sign images and text, which suggests that the names given to signs may be obscuring signals in the corpus. Our results reveal previously unknown regularities in proto-Elamite sign use that can inform future decipherment efforts, and our image-aware language model provides a novel way to abstract away from biases introduced by human annotators.


Introduction
This work sets out to understand a category of signs called complex graphemes (CGs) in the undeciphered proto-Elamite (PE) script, a writing system from ancient Iran dating to approximately 3300-2900 BC (Dahl et al., 2013). 1 PE is partly contemporaneous with the world's other two earliest writing systems, Egyptian hieroglyphs and proto-cuneiform, and is the least deciphered of the three, with the underlying language(s) remaining unknown. PE was used exclusively as an accounting technology, employing several numerical systems whose bundling principles are known. Although written in continuous lines, PE, like protocuneiform, is most comparable to an accountant's spreadsheet; some structures and rules governing 1 Our code, data, and trained models are available at https://github.com/sfu-natlang/ pe-compositionality sign use have been identified (Hawkins, 2015;Dahl et al., 2018;Englund, 2004).
The corpus consists of approximately 1500 published clay tablets from excavations in Iran, almost all of which exist in electronic transliteration following the conventions of a work-in-progress sign list (Dahl, 2006). As with other decipherments, understanding the nature of signs and the nuances of sign use is as important as identifying the underlying language(s). Meaningful information can be recovered and the texts partly "read" even if the language remains unknown.
To better understand sign usage in PE, this work proposes an architecture for image-aware language modeling, which permits sharing information between visually similar signs much as sub-word units share information between words. We use sign embeddings to demonstrate patterns which are not readily apparent due to the complexity of the accounting system and the large number of sign shapes found in the script. Our analysis offers insights on complex graphemes that can aid in future hypothesis generation. We confirm that some transliteration choices by PE specialists capture meaningful semantic divisions in the script; this is not a trivial fact, due to the large number of similar looking signs. By using image-aware models, we also observe that some signs with distinct names receive very similar embeddings, implying a functional equivalence that could be exploited by merging signs to create a less sparse corpus that is more amenable to analysis by NLP methods.

Methodology
As described by Dahl (2005), CGs in proto-Elamite are signs that consist of one sign inscribed within another (transliterated |S1+S2|), or of one sign framed by two instances of another (|S1+S2+S1|). Rarely, S1 and S2 occur connected at the side, as in |M296+M296| . We refer to S1 as the outer sign and S2 as the inner sign, though we acknowledge this terminology is not quite appropriate in cases like |M296+M296|. Most signs which occur as part of a CG can also occur as standalone signs. Exceptions to this are rare, such as M600 which only ever occurs in the hapax |M362+M600|.
Although these signs are orthographically compositional, it is not known whether they are also semantically compositional. Similar constructions exist in proto-cuneiform (PC), including "containers" with signs inscribed to indicate specific products (Wagensonner, 2015). Some PC compounds survive into later cuneiform, and sometimes have idiomatic meanings, e.g. cuneiform GU 7 "eat", a combination of "head" and "bowl". Chinese characters likewise exhibit varying degrees of visual and semantic compositionality (Sproat, 2006).
Past work (Mikolov et al., 2013b;Salehi et al., 2015;Cordeiro et al., 2016) suggests that embedding models capture semantic compositionality in noun compounds and multiword expressions. Often, these models assign a compound a representation which is similar to the sum of the representations of the words in the compound. Thus we predict that if CGs are semantically compositional, their embeddings will be additively compositional at a higher rate than expected by chance. Their embeddings may also exhibit other signs of internal structure, such as the ability to model proportional analogy between CGs with shared components: |M136+M365| : M136 :: |M327+M365| : M327 : :: : If this analogy holds in the embedding space (which is to say that the 3CosAdd formula |M136+M365| -M136 + M327 ≈ |M327+M365| holds between the signs' embeddings) this would give further evidence that the CGs involved have some degree of semantic compositionality.
Unfortunately, most PE signs are rare, which impedes a model's ability to learn meaningful information about their distributions. Yet many signs with distinct names have striking visual resemblances, and it is usually not known whether they have different meanings. Visual information may therefore help an embedding model by allowing it to share distributional information across graphically similar signs. To this end, we propose an architecture for multimodal language modeling in Figure 1. This architecture uses two separate embedding components. On the left of Figure 1, in red, is a standard embedding layer which replaces a one-hot input with a small, learnable representation. On the right, in blue, a lookup function retrieves an image of the sign represented by the input. A CNN extracts a feature vector from the image, which is max-pooled, flattened, and passed through a dense layer to produce a low-dimensional embedding. Both embeddings are concatenated and fed to a BiLSTM 2 (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997) which attempts to predict the name of the next sign in the text. All timesteps share the same weights for the CNN and embedding layers. By omitting the blue image-embedding component we can obtain a normal BiLSTM language model. By omitting the red text-based component, we can obtain an image-only model which never directly sees the labels assigned to the signs in the corpus.  To verify that this architecture captures distributional properties of signs, and not just visual properties, we train a separate image recognition model to predict a sign's name given only its image.   This model uses the blue image embedding component from Figure 1 to produce a representation of an input image; a dense layer predicts the name of the sign from this embedding. This model only sees signs in isolation, meaning it will not learn from distributional information. If a result holds for the multimodal LM but not for this image recognition model, this implies that the result arises from contextual information in the text, and not simply from visual resemblances between signs. We also train CBoW and skipgram models with FastText 4 (Bojanowski et al., 2017) and word2vec (Mikolov et al., 2013a), as well as GloVe embeddings (Pennington et al., 2014). Table 1 summarizes all of the models used in this work and important hyperparameters. We train these models on the PE corpus from Born et al. (2019), which is a cleaned version of texts originally published by the Cuneiform Digital Library Initiative (CDLI). This contains digitized transliterations from 1399 tablets comprising 11013 lines in total, or 33778 tokens. 7508 tokens represent broken or unreadable signs, and another 11364 represent numerals, leaving only 14906 non-numerical tokens. 1107 tokens (comprising nearly half the sign types in our cleaned data) are labeled as CGs. We treat each entry of a tablet as a single input sentence for training LMs, and set aside 500 lines as a validation set.
Prior to training, we replace all signs occurring 3 or fewer times 5 with UNK. We replace rare signs wherever they occur, including inside of CGs. The tokens X and ... represent broken or unreadable signs, so we also replace these with UNK. When 3 Experimental Results

Additive Composition
We predict that if a CG is semantically compositional, its embedding will approximately equal the sum of the embeddings of the signs it comprises.
Given a sign s, let e s denote the embedding of s. If s is a CG let σ(s) denote the list of signs which make up s. For every CG s in the signary, we check whether t∈σ(s) e t ≈ e s . If t∈σ(s) e t is within the k nearest neighbors of e s for some threshold k, we say that s appears to have a compositional representation.
For different thresholds k, we measure how many CGs have compositional representations. Since many PE signs have low frequency, we predict that noise may drown out any signal when k is small. However, when k is large enough to overcome this noise, we predict that the number of CGs with compositional representations will be greater than expected by chance, as we expect that some CGs have meanings which are semantically compositional rather than idiomatic. Table 2 shows the results from this evaluation.
In text-only models, when k is small the number of CGs with compositional representations is no higher than expected by chance. However, for image-aware models, and for text-only models with large k, the number of CGs which are close to the  sum of their components is significant. Even for k = 15, the signs identified as compositional by lm.image.64 average >0.97 cosine similarity to the sum of their parts, suggesting this is not too generous a threshold.
Notably, the number of compositional CGs in lm.image.64 is always larger than the number in any of the other models, including the image recognition model. 6 This has the important implication that compositionality in the embeddings is not solely a consequence of visual compositionality. If that were the case, the contextual information available to the LM would not be useful for this task, and the image LM would not be expected to find more compositional CGs than the image recognition model. Moreover we would not expect to find a significant amount of compositionality in any of the text-only models for any k. Table 3 shows examples of signs which appear to be compositional in the image LM but not the image recognition model. These are signs for which contextual information plays a deciding role in making them appear semantically compositional, and which may therefore be of interest to analyze in future work. We emphasize that the text-only models have no information about sub-words (such as CG com-ponents), so any compositionality in these models must exclusively reflect distributional properties. From these results we conclude that there is legitimate evidence for some CGs having semantically compositional meanings in PE.

Pairing Consistency
To assess the contribution of a sign to the CGs it occurs in, we consider the pairing consistency score (PCS) from Fournier et al. (2020). This metric measures whether the offsets between pairs of words are more parallel than expected by chance. If a sign s always contributes the same meaning to the CGs in which it occurs, then the offset between the pair of signs (t, |t + s|) is expected to be roughly parallel to the offset between the pair (u, |u + s|) for most choices of t and u. If CGs containing s have idiomatic meanings (so the contribution of s is not consistent), the offsets between such pairs are not likely to be parallel. Thus PCS serves as a proxy for compositionality, and allows us to investigate the impact of individual signs on the representations of CGs in which they occur. This is distinct from a measure like mutual information which depends on raw sign counts and does not account for the internal structure of sign embeddings.
For each sign s we construct two relations. R s,in contains all CGs with s as the inner sign, paired with whichever sign forms the outer part of the CG. R s,out contains all CGs with s as the outer sign, paired with whichever sign forms the inner part of the CG. Formally, given a CG c containing a sign s, let δ(c, s) denote the element of c which is not s. Further, let I(s) be the set of all CGs with s as the inner element and O(s) be the set of all CGs with s as the outer element. Then Table 4 reports the average PCS 7 of R s,in and R s,out for each model, averaged across all signs s. On average, we find that inner signs have higher PCS than outer signs. This difference is statistically significant in the image-aware LMs, the image recognition model, and FastText. This implies that inner signs have a more consistent and predictable impact on the representation of compounds in which they occur. The fact that this holds for some text-only models as well as for the imageaware LMs implies that it is due to distributional properties of signs and not simply their appearance.   Fournier et al. (2020) note that different categories of relations in English have different average PCS. They find that relations involving inflectional morphology (for example, between a verb and its gerund) have high PCS, relations involving derivational morphology (as between heat and reheat) have lower PCS, and other semantic relations (as between hot and cold) have the lowest PCS of the relations they examine.
We expect that absolute PCS values will not be comparable between PE and English, owing to the very different nature of the two writing systems. However, it may be possible to draw broad comparisons between different categories. As the category with the highest PCS, inner signs appear to pattern with inflectional morphology, while outer signs pattern more closely with regular lexical items. This does not imply that inner signs actually encode inflectional morphology: most PE signs likely correspond to objects or ideograms, and most types of morphological marking were absent in the earliest phases of Near Eastern writing (Nissen et al., 1993). Rather, we suggest that inner signs may offer minor refinements to the meaning of an outer sign without fundamentally changing its value, parallel to the way that inflecting a verb refines its role in a sentence but does not change its basic meaning.

Analogy
Our PCS results measure sign behaviour in aggregate, but do not provide specific examples of relations between signs. We augment these results by searching for concrete analogies which hold in the embedding models.
Given two CGs s and t, let s − t denote the signs that are in s but not t, and let s ∩ t denote the signs both CGs have in common. Consider the vector This vector represents the analogical formula s : (s − t) :: t : (t − s). If A(s, t) ≈ e t in a particular embedding model, then this analogy appears to hold true according to that model.
We compute how often A(s, t) is within the k nearest neighbors of e t for different thresholds k when s∩t = ∅. We also compute how often A(s, t) is close to e t when s and t are randomly chosen CGs. We predict that CGs which have signs in common also have some meaning in common, and consequently that the former value will be significantly larger than the latter value. Table 5 shows the results of this evaluation. As in the compositionality task, more analogies hold between CGs with shared components in imageaware models than in text-only models, and the largest number by far occur in the image LM. Once again, in lm.image.64 the target vector averages >0.97 similarity to the computed vector even when k = 15. Bold numbers in the table represent cases where analogies are significantly more likely to hold between CGs with shared components than between random pairs of CGs. We see that the number of analogies is larger than expected by chance even in some text-only models, suggesting that there is a meaningful relationship between some CGs which have elements in common. The fact that the image LM outperforms the image recognition model further implies that these analogies reflect legitimate distributional properties and are not purely due to visual resemblance.   Figure 2: Containment hierarchy for a subset of the signs which can occur in CGs. Directed edges point from outer signs to the inner signs they can contain. Note that (excluding self-loops) the graph is acyclic and all edges point from higher nodes to lower ones. Thicker edges represent CGs which are more strongly compositional. Nodes are colored according to modularity class (Blondel et al., 2008) such that nodes are most strongly connected to like-colored nodes. Full hierarchy, showing all signs which occur in CGs, is available in supplemental material.
As was the case for additive compositionality, the image+text LM underperforms the image-only LM, and on this task the difference is much more pronounced. This suggests that sign names act as distractors: if sign names conveyed information which was helpful to the analogy task, their inclusion would be expected to improve performance. This fact has implications about the labeling of the data which we return to in Section 4.
Taken altogether, the results suggest that many CGs have compositional meanings which can be understood by comparison to the meanings of their component parts and the other CGs with which they share components. We next consider which pairs of signs are able to combine into CGs and which pairings are never observed.

CG Containment Rules
Some signs which occur as the inner part of one CG may also occur as the outer part of another, as with M348 in |M327+M348| and |M348+M004| . We may therefore expect to find pairs of signs where either one can contain the other, and yet, no such pairs actually exist. In fact, we find that CGs appear to be constructed according to a strict hierarchy whereby a sign may only contain itself or another sign which is lower on this hierarchy. We can visualize this as a lattice with directed edges from outer signs to the inner signs they are observed to contain, as in Figure 2 (excerpted from the full hierarchy available in the supplemental material). The thickness of an edge in this figure is proportional to the compositionality of the corresponding CG in lm.image+text.64.
There appears to be some relation between a sign's compositionality and its position in this lattice. The signs on the left half of Figure 2 have low compositionality (seen as thinner edges in the figure) while the nodes to the right have higher compositionality (seen here as thicker edges). This suggests that there may be different kinds of CG, of which some are idiomatic and some are not, and that these categories have sufficiently little overlap to appear as separate modules in the lattice.
This "grammar" governing CG construction has not been noted in previous PE scholarship. The ordering of signs within this hierarchy deserves attention in future work, as it may reflect different levels of administrative units in PE society, degrees of specificity in qualifying commodities, or other information which can be exploited to understand the content of these texts.

Analysis
Little is known about the role of CGs in PE, although these signs make up a significant portion of the corpus. Some occur in "headers" appearing at the beginning of a text. In headers, outer signs (such as M157) are hypothesized to indicate the type of household or institution to which the entire account relates. The outer sign may be further specified by an inner sign, but many (including M157) can also appear without another sign inscribed within. Inner signs are hypothesized to specify a particular kind of item being recorded, a person, profession, or administrative department related to an account, and more.
Our results are consistent with these hypothe-ses. The PCS results point to inner signs playing a specializing role; this is corroborated by visual inspection of the embedding space, which reveals that CGs cluster according to their outer sign rather than their inner sign (cf. Figure 3 below).
According to Table 2, our text-only models detect additive composition in at most one of every 10 CGs; the image LM detects it in one of every 4 CGs. Likewise, the image LM suggests that a meaningful analogical relation obtains between slightly less than one-third of all pairs of CGs with signs in common. These values depend on the threshold k, but they suggest the presence of a least a small core of compositional CGs in PE. In several places, compositional and non-compositional CGs appear separated from one-another in the CG containment hierarchy (cf. Figure 2), which may point to this being a legitimate distinction in the writing system and not a failure of our models to detect compositionality in some cases where it is really present.
We can make some inferences about the CGs which are compositional. They are not likely to represent either combinations of ideograms with an emergent lexical value (like the Sumerian cuneiform sign for nan "drink" combining the signs for human head and water) or ideograms with phonetic complements (signs indicating the proper reading of the CG), as both cases should be expected to produce non-compositional meanings. Our results may also counter-indicate "coatof-arms"-like symbols (Farmer et al., 2004), since we show that the components of CGs can often be understood in relation to their use elsewhere in texts, and since CG elements on their own often seem to reference products (including foodstuffs and livestock) and their distribution. Future work may train embedding models on proto-cuneiform, a structurally-similar writing system containing compound signs with occasionally known meanings that could act as useful points of comparison.
The two components of a CG can occur independently, within the same text or even side-by-side. A dramatic example comes from |M218+M288| , the components of which appear 37 times as the bigram M218 M288. M288 ("grain container") is the most frequent sign in PE, appearing in diverse contexts but often before numerical measures of capacity. M218 is among the signs speculated to function "syllabically" to write personal names (Dahl, 2019), though it may also have other uses. It is not clear yet whether |M218+M288| and M218 M288 operate identically, particularly since |M218+M288| is not strongly additively compositional in any of our embedding models. The possible polyvalence of M218 and broad distribution of M288 may impact models' ability to detect compositionality in |M218+M288|. Despite this difficulty, the image LM identifies analogies between |M218+M288|, |M175+M288|, and |M305+M288| (the analogy vector has >0.99 cosine similarity to the target in both cases) implying that we should at least consider M218, M175, and M305 as parallel categories each with relation to grain capacities.
Some signs rarely occur outside of CGs, such as the productive inner sign M342 , about which practically nothing is known. Our data show that it has moderately high PCS (0.69 in lm.image.64) and that analogies hold between all but one of the CGs which contain M342 (|M157+M342|, |M304+M342|, |M305+M342|, |M325+M342|, |M327+M342|, and |M351+M342|, excluding |M153+M342|). These analogies hold strongly for the image LM but not the image recognition model, meaning they reflect primarily distributional properties. Many of these signs are also additively compositional. We believe that these signs may be suitable starting points for future analysis, as our results imply that they are probably not idiomatic and are likely to have related meanings.   Table 6 gives additional examples of analogies which hold in lm.image+text.64. We see that inner and outer signs both participate in analogical relations, as do both |S1+S2|-type CGs and |S1+S2+S1|-type CGs. Some analogies hold between a CG with a numeric inner sign and one with a non-numeric inner sign, as between |M036+1(N39C)| and |M036+M035|. Such cases may have implications to the meaning of the signs involved; if 1(N39C) and M035 truly have parallel functions in these two CGs, this may imply a kind of quantifying role for M035, or alternatively that 1(N39C) is used for its pronunciation or possible syllabic value rather than as a true numeral. The existence of other M036 compounds containing numerals (e.g. |M036+1(N30D)| and |M036+1(N14)|) would seem to favor the former interpretation.
The image-only LM found stronger signals for compositionality and analogical relations than the image+text LM, suggesting that sign names acted as distractors for those tasks. This has significant implications for the ongoing process of revising the PE sign list. Our work relies on the sign labels assigned through an exhaustive manual transliteration process; since it is easy to automate merging signs, this process assumed that most signs are unique until proven otherwise. However, we now believe this choice weakens signals in the text data by making most signs very rare. Moreover, some signs which appear graphically compositional are not currently labeled as CGs, usually when the inner part is never attested as a standalone sign. For these reasons, future work may benefit from relabeling signs based on a combination of context and sign shape.
At the same time, the current transliteration system may record meaningful (if fine-grained) information reflected in minor graphical details (consider M263 and M262 ), such as (hypothetically) "jug of red beer" versus "jug of dark beer". Such similarly functioning signs might obtain similar embeddings, but retaining their distinction in the published transliterations still improves our understanding of the texts. However for both manual and machine-learning analysis, significant reductions in the sign list may open new avenues for decipherment: for instance, Born et al. (2019) note that frequency-based approaches to decipherment are currently difficult in PE owing to the very small number of repeated n-grams in the corpus. Figure 3 shows details from the embedding spaces learned by GloVe, the image LM, and the image recognition model. 8 GloVe produces small clusters of visually similar signs even though it does not have access to sign images: note the proximity of M353 , M354 , and 2(N30C) , as well as the variants of M036 . These clusters occur in sufficient number that we have confidence the model is detecting meaningful similarities in the usage of visually similar signs. The image recognition model produces much clearer groupings of visually related signs, as would be expected. The image LM replicates some clusters from the image recognition model: a cluster of lozenge-shaped signs is visible in both the image LM figure and the image recognition figure. However, contextual information causes the image LM to relocate other lozenge-shaped signs like M218 to a different part of the embedding space, implying a functional difference between it and the signs in the figure.
Overall, these observations confirm that our multimodal architecture is finding a balance between contextual and visual information as intended. Sun et al. (2019) introduce "character-enhanced" embeddings of Chinese words. Their architecture roughly parallels our own, but requires a deeper CNN due to the visual complexity of Chinese characters. We train with a full context language modeling objective whereas they use a sampling scheme similar to word2vec. They use character-level information to improve word embeddings, where we exclusively learn character embeddings. Our application of this architecture to decipherment is novel. Liu et al. (2017) explicitly learn compositional embeddings for Chinese characters. They use su-pervised data to help identify when two visuallydistinct signs use the same radical (as in 水 and 池). In our data, it is not known which signs are truly related to one another, thus we refrain from giving the model explicit information about compositionality. Yin et al. (2019) segment and transcribe undeciphered scripts based on visual similarities between glyphs. Although their transcription error rate is high, they still achieve partial decipherments with no human intervention. Dencker et al. (2020) perform OCR-style sign detection on images of Sumerian cuneiform tablets, recognizing signs which may be written very differently across the corpus. Their task benefits from the existence of supervised Sumerian training data. Born et al. (2019) train topic models on PE texts and cluster PE signs in a simple mutual information-based embedding model. The present work considers more sophisticated embedding models and performs a more detailed investigation of the embedding space. Luo et al. (2019) perform automated decipherment of Ugaritic. Their technique finds alignments between orthographic representations of phonetic information, and thus is not easily applicable to ideographic scripts. It also requires multilingual data, and cannot extract information from a script with no known surviving relatives.

Related Work
Our work exploits the embedding space learned by a neural language model, but the actual task of language modeling is otherwise irrelevant to our results. By contrast, Kambhatla et al. (2018) actually sample text from a neural language model to help estimate the quality of a proposed decipherment. Future work could similarly sample from a language model as a means of counteracting the small size of the PE corpus; this should be done with caution, however, given the difficulty of evaluating whether the sampled text is fluent. Salehi et al. (2015) and Cordeiro et al. (2016) demonstrate that English word embeddings tend to be additively compositional and can capture human intuitions about semantic compositionality. Hartung et al. (2017) investigate other methods for decomposing word embeddings. Sproat (2006) discusses a variety of writing systems and the degrees to which they employ phonetic versus semantic information. The discussion is largely taxonomic and addresses subtle nuances between scripts which are already well-understood. In this way it demonstrates the wide range of varia-tion observed between scripts, and by extension the range of possibilities which should be considered when analyzing an undeciphered script such as PE.

Conclusion
Interpreting what a word embedding model has learned typically involves a comparison to native speaker intuitions. In contrast, in this work we have shown how exploiting graphical compositionality and carefully examining sequences of image embeddings can lead to new insights in proto-Elamite (PE), an undeciphered script with no living users and relatively little available data. Abstracting away from human annotations, we introduced a novel architecture for multimodal or image-based language modeling, which shares information between visually similar signs to better model contextual patterns. This provides a new toolkit for decipherment of an unknown language, distinct from translation-based approaches.
As one of the world's earliest experiments in writing, employing 774 signs and variants by current estimates, reasonable concerns have existed over PE's level of standardisation and the impact this may have on decipherment (Dahl, 2019:71, 82). The corpus is small and filled with lacunae, and prior work has done little to understand how NLP techniques function on early writing systems which may reflect linguistic content differently from modern writing systems. Despite these challenges, this work has shown that embedding models can indeed identify meaningful patterns in proto-Elamite.
We have presented evidence that a subset of complex graphemes are semantically compositional rather than idiomatic, and we have discovered the existence of a simple grammar or partial ordering which appears to govern the construction of CGs. Our results should give domain experts confidence that the proto-Elamite script contains sufficient regularities to allow for describing its mechanics and potentially understanding the underlying content.