Verifying Annotation Agreement without Multiple Experts: A Case Study with Gujarati SNACS

,


Introduction
Most NLP research focuses only on a few languages: a small fraction of the about 7,000 languages of the world have datasets or linguistic tools.For example, the Universal Dependencies project (Nivre et al., 2016), perhaps the most linguistically diverse community effort in recent years, only covers about 130 languages.Even widely spoken languages like Gujarati and Hausa-both with about 50 million native speakers (Eberhard et al., 2022), almost equaling the population of Englandare considered low-resource in the NLP context.In a "rich-get-richer" effect, these trends can increase disparities across the world's languages.
Progress in NLP for high-resource languages like English has been powered not only by advances in modeling techniques, but also by high quality linguistic datasets like the Penn Treebank (Marcus et al., 1994), PropBank (Palmer et al., 2005;Pradhan et al., 2022), OntoNotes (Hovy et al., 2006), and TimeBank (Pustejovsky et al., 2003).Making and tracking progress in low-resource languages requires building similarly large high quality datasets.But what defines a good dataset?The standard way to measure the quality of a manually annotated dataset involves computing an interannotator agreement metric such as the ubiquitous kappa score (Artstein and Poesio, 2008). 1he very notion of inter-annotator agreement hinges on the availability of at least two annotators.But multiple annotators might not be available, e.g., for a low-resource language.In this paper, we ask: Can we verify annotation quality when only a single expert annotator is available?To address this open question, we observe that evaluating the quality of annotated data involves measuring the compatibility between labels proposed by one annotator with a second source of information about the labels.For inter-annotator agreement, the second source is another human trained on the same task.We take the position that for low-resource languages where multiple annotators are unavailable, we can relax the requirement that the second source needs to be a human expert in the same language.To illustrate this position, we propose four weak annotation verifiers and outline the scenarios in which they can be helpful.
We evaluate the efficacy of our verifiers via a semantic annotation effort in Gujarati, a low-resource Indic language.We consider the task of supersense disambiguation of adpositions, which are known to be extremely polysemous.We use an inventory of adpositional supersenses (SNACS, Schneider et al., 2015, 2018) to construct an annotated corpus of a Gujarati version of the book The Little Prince.Experiments with this data indicate that our verifiers can be successfully employed in scenarios where

The Problem of the Single Annotator
Dataset creation in NLP typically involves multiple annotators (experts or crowd workers) labeling a corpus using the task definition and annotation guidelines, followed by a manual or heuristic adjudication step to construct the aggregated ground truth.Datasets form the backbone of computational linguistics and NLP research; ensuring their quality is of paramount importance.Their quality is commonly measured using annotator agreement, and metrics such as Cohen's kappa (Cohen, 1960) reflect consensus (Artstein and Poesio, 2008).A good inter-annotator agreement (IAA) score-over 0.6, per Landis and Koch (1977)-implies a better agreed-upon dataset, whereas a poor one may indicate gaps in the task definition or annotation guidelines.Interesting insights can be drawn by viewing IAA scores alongside model performances.For instance, a dataset with high human consensus and a poor model score suggests a task seemingly simple for humans like common sense reasoning, but difficult for our models (Talmor et al., 2019).
However, human agreement is undefined when we have only one annotator.This could happen when: (1) The task requires specialized expertise, like biomedical named entity tagging (Sazzed, 2022), or, (2) The language does not have readily accessible NLP expertise (Hedderich et al., 2021), such as the Universal Dependencies annotation for the K'iche' (Tyers and Henderson, 2021), and Breton (Tyers and Ravishankar, 2018) languages.In this paper, we study the question of evaluating annotation quality of such singly annotated datasets.
The Principle.Measuring agreement requires two separate sources of annotation, which we will call the primary and secondary sources.In a multiple annotator setup, both are human.When annotation by multiple experts is not possible, we should consider other available resources.In this work, we suggest several resources that can serve as the secondary annotation source.These can be in the form of pre-trained contextualized embeddings, parallel corpora, human expertise in a cognate language, or native speakers of the target language who are not linguistically inclined.In §4, we propose verifiers that use these resources as secondary annotation sources, and evaluate their effectiveness in §5.

Gujarati SNACS: A Case Study
This section introduces a new semantic annotation dataset that will serve as a testbed for our verifiers.

Background
Adpositions (pre-, post-, and inpositions) and case markers are ubiquitous grammatical components that bear diverse semantic relations and are extremely polysemous (Litkowski and Hargraves, 2006;Müller et al., 2010, and others).Schneider et al. (2015) categorized their semantic behavior into coarse-grained categories called supersenses.Hwang et al. (2017) argued that a single supersense label is insufficient to capture the semantic nuances of adpositional usage.They theorized the idea of construal, where adpositions are labeled for: a) their meaning in the broader scene, i.e, scene role, and b) the meaning coded by the adposition alone, i.e., function.Schneider et al. (2018) defined a hierarchy of fifty supersenses called SNACS (Semantic Network of Adposition and Case Supersenses) and annotated a corpus of English prepositions with construals.SNACS has since been extended to multiple languages; annotated corpora exist in Korean (Hwang et al., 2020), Hindi (Arora et al., 2022), among other languages.
In this work, we extend the SNACS project to Gujarati, an Indic language spoken in western India, with about 56 million L1 speakers (Eberhard et al., 2022).Yet, in NLP research, it remains impoverished.Gujarati grammars (Tisdall, 1892;Doctor, 2004) discuss the syntactic usage and diversity of Gujarati adpositions and case markers but their semantic versatility is hitherto unstudied.Gujarati is closely related to its somewhat higher resource cousin Hindi, especially in adposition usage.3

Dataset and Annotation
A bilingual speaker of Gujarati and Hindi annotated all adpositions in Nānakado Rājakumār, a Gujarati translation of the novella Le Petit Prince (The Little Prince, de Saint-Exupéry, 1943) following its use for SNACS annotation in other languages.Since Gujarati has similar adpositional usages as Hindi, the annotator followed the Hindi-Urdu guidelines v1.0 (Arora et al., 2021), while referring to the English guidelines v2.5 (Schneider et al., 2017) for definitions.
We show some examples below with the annotations for the adposition highlighted in bold.
In example (1-a), the ergative marker conveys the agency of the action of giving to Sam in the phrase.Hence, the marker gets an AGENT scene role which is the same as the function since the ergative marker is prototypically used to describe agency.However, in example (1-b)., the locative adposition par is used to convey an instrumentative relation between the phone and the action of talking, meriting different scene role and function annotations. (

Weak Verifiers of Annotation Quality
This section presents four weak verifiers that assess annotation quality.We introduce each one by first stating the prerequisite resources that are needed to use it.We refer to the low-resource language of interest as the target language.We note that the verifiers are all weak: they are not meant to replace a second annotator, but can help gauge dataset quality in the absence of multiple annotators.

Using Contextualized Representations
Prerequisites.Pre-trained contextualized representations in the target language.
The research program of training contextualized embeddings (e.g., Devlin et al., 2019) using massive amounts of unlabeled text now extends to multiple languages (e.g., Conneau et al., 2020).We propose that, besides their use for model building, these embeddings can also help verify annotations.
For this purpose, we use DIRECTPROBE (Zhou and Srikumar, 2021), a heuristic that probes embeddings using their geometric properties.It clusters labeled instances in an embedding space such that each cluster contains examples with the same label.The number of clusters indicates the linear separability of labels in that space: if the number of clusters equals the number of labels, then the labeled points are linearly separable by a classifier.
Given a singly annotated dataset, we can project the annotation targets into an embedding space usdataset artifacts to achieve its performance.If the representation disagrees with the human, it would place an example within a cluster associated with the "wrong" label, breaking the cluster into sub-clusters.Consequently, the number of clusters would increase as the disagreement increases.
Merely verifying a one-to-one mapping between labels and clusters is insufficient.We need to compare to how random label assignments behave.If examples were randomly labeled, we should obtain a large number of clusters-in the worst case, almost as many as the number of examples. 6wo factors determine the affinity of an annotation with an embedding: (1) each label in the annotation should occupy a separate region (i.e., a distinct cluster) in the embedding space, and (2) if the labels are randomly shuffled across examples, the number of clusters should increase.The latter accounts for the possibility of labels being grouped in the embedding space by chance.We define a metric, CONTEXTUAL REPRESENTATION AFFIN-ITY (CRA), that takes both factors into account to assess the chance-corrected affinity (as in the kappa score) between an annotation and an embedding: Here, C org is the number of clusters produced by DIRECTPROBE with an annotated dataset, and C rand is the number of clusters obtained when its labels are randomly shuffled while ensuring that the label distribution is conserved. 7he design of the CRA metric is inspired by Cohen's kappa, and can be interpreted as quantifying the regularity introduced into the embedded points beyond a random labeling.When labels are grouped into a small number of clusters (i.e., low C org ), but random labeling leads to a large number of clusters (i.e., high C rand ), then the CRA will be high.This means that the representation agrees and the annotations bear information that goes beyond chance.However, a low CRA score does not guarantee disagreement.With a low CRA score, we need to look at the C org and C rand values.When C org is high, labels occupy overlapping regions of the embedding space and the labeled data has low affinity with the embedding.However, when both counts are low, and close to the number of labels, both label sets occupy distinct regions of the embedding space.In such a case, CRA is not conclusive.Table 2 summarizes these four scenarios.

Using Cognate Language Annotation
Prerequisites.Annotated corpus in a cognate language, Bilingual or multilingual expert annotator for the target language.Some annotation projects (e.g., our case study) involve parallel annotated corpora.Existing annotation in a cognate language can be used by manually or automatically aligning sentences and comparing annotations (manual alignments require a bilingual annotator).We can then measure agreement of labels assigned to the aligned components.A similar approach had been undertaken by Daza and Frank (2020) for semantic role labeling.They use mBERT (Devlin et al., 2019) embeddings to align predicate and arguments from English to various other target languages, and project gold annotations in English to the target languages.
Two points are worth noting.First, the cognate language need not be a high-resource language.Second, this approach is inapplicable when the labels are not preserved across translations.For example, for the task of grammatical gender classification, we can use this verification strategy only if both languages follow the same gender classes, and carry the same gender for translations of nouns.

Translate and Verify
Prerequisites.Bilingual or multilingual translator between the target and a cognate language, and an expert annotator in the cognate language.This approach, like the previous one, requires that labels be preserved across the target-cognate translation.However, instead of relying on existing annotation and alignment tools, it requires an expert annotator in the cognate language.A bilingual speaker is required to translate the text in the target language to the cognate language conserving the intricacies of the task.The annotator can then label the translated corpus and the labels can be compared for agreement.

Verification Using Non-expert Annotators
Prerequisites.A pool of non-expert annotators in the target language.
Certain tasks, by design, are not amenable for crowd-sourcing due to their complexity.Much work has been dedicated in making the annotation easier by methods like enforcing an annotation curriculum (Lee et al., 2022), and iterative feedback (Nangia et al., 2021), to name a few.He et al. (2015) propose querying annotators for question-answer pairs for the Semantic Role Labeling task which might not be straight-forward for a non-expert.Wein and Schneider (2022) propose a worker priming approach where a proxy task primes a crowd worker to a subsequent downstream task for which annotated data is required.
At a high level, verifying with non-expert annotators involves casting the target task into task(s) more favorable for annotation by non-experts (who may possibly be anonymous crowd workers).Naturally, the simplification process would vary from task to task.In §5.4,we provide a concrete instantiation of this idea for our target task.

Evaluating the Verification Strategies
In this section, we instantiate the verification strategies from §4 for the Gujarati SNACS annotation task to empirically evaluate them.We show experiments on additional datasets wherever possible.
Since the CRA score is a novel contribution of this work, in addition to presenting the scores for Number of DIRECTPROBE clusters.We applied Zhou and Srikumar (2021)'s implementation of DIRECTPROBE8 to our dataset for the six embeddings.In all cases, and for both scene role and function, the number of clusters C org obtained from the singly annotated dataset is the minimum, namely the number of labels.In other words, for both tasks, across all embeddings, each label is allocated a separate region of the embedding space.
Next, to confirm that the embeddings can recognize bad annotations, we shuffled q% of the labels for q = {5, 10, 25, 50, 75, 100}.(Note that the number clusters for q = 100% is C rand .)Recall from Table 2 that if randomized labels do not correspond to an increased number of clusters, we cannot draw any conclusions.Figure 1 shows the trend for scene roles.We average the results over five random runs for each value of q.We observe that the number of clusters increases with increased randomization of labels for all embeddings.
Gujarati CRA scores.Table 3 shows the CRA scores for the scene role and function tasks with the various embeddings.We see that all representations are similar in how they handle random label assignments.Consequently, their CRA scores, which measure the affinity of the annotation beyond chance, are in a similar numeric ranges for both tasks.Figure 2 shows the behavior of CRA scores with increasing randomization of labels.
We see that the CRA scores are negatively correlated with the amount of randomization in the labels.In other words, noisier annotations (via CRA behavior.To better understand the numeric ranges of the scores, and to illustrate a failure case, we apply the approach to several existing datasets: a 10k subsample of the SNLI dataset (Bowman et al., 2015), the English SNACS STREUSLE corpus (Schneider et al., 2018), and Estonian EWT Universal Part of Speech (UPoS) dataset (Muischnek et al., 2014).We used the XLM-R large in all cases.Table 4 shows their scores.
With SNLI, we see that C org is more than the number of labels, namely three, suggesting we have a minor disagreement between the representation and the annotation.However, we also see that the random annotation fares much worse (thrice the number of labels).The CRA score suggests that its affinity to the embeddings beyond chance is slightly less than the case of Gujarati SNACS .
On the Estonian UPoS data, C org is equal to the number of labels while the C rand is about seven times more.Hence, this yields a high CRA score.
We observe low CRA scores with the English SNACS datasets.We also see that C rand values are small and close to their respective C org , placing us in the last row of Table 2.The verifier is unsuitable for this embedding-dataset combination.We conjecture that this might be due to a wider spread of the data in the embedding space which allows even a random labeling set to show clustering behavior.
Kappa vs CRA Correlation Analysis.To show that the CRA score behaves like the an agreement score, we conduct an experiment using the TweetNLP data (Gimpel et al., 2011) 2014) supplemented the original labels by crowd sourcing five annotations per instance.We use the majority crowd label for this experiment, and add noise to it by shuffling q ∈ {0, 5, 10, 25, 50, 75, 100} percent of the labels.For each case, we compute the kappa against the original gold labels and also the CRA score using XLM-R large .We compute these scores with five random shuffles for each q.
Figure 3 shows a scatter plot between the scores.We observe that as noise increases, both scores decrease, and we have a high Pearson correlation of 0.915 between the two scores.This gives additional validation to the CRA metric as a measure of agreement.

Using Cognate Language Annotation
To instantiate the verifier in §4.2, we compared our Gujarati annotation with the adjudicated Hindi annotation of The Little Prince of Arora et al. (2022) by aligning sentences between the translations.A target is aligned if, in the parallel sentence, the object of the adposition and its governor match.We used chapters 4, 5, 6, 17, and 21 for this task.
A bilingual Hindi-Gujarati speaker performed the manual alignment.
Of the 757 adpositions annotated in the selected Gujarati corpus, 526 tokens could not be aligned to the Hindi corpus as semantically equivalent sentences can be written using a different adposition (i.e. that is not a direct translation), or even without using one.For the remaining 231 aligned tokens, we computed the Cohen's kappa to obtain the agreement between the Hindi and Gujarati annotations.We observe high agreement scores of 0.781 and 0.886 for scene roles and functions respectively.
To verify if the alignment introduced any bias, we compare the kappa scores from the double annotated study for the aligned and the unaligned tokens for chapters 4 and 5.The IAA for the aligned tokens were 0.932 for the scene roles and 0.959 for the functions, while for the unaligned ones they were 0.875 and 0.930 respectively.This suggests negligible bias due to alignment, if any.On the aligned tokens of the same chapters (i.e., chapters 4 and 5), the kappa score between the Hindi and Gujarati aligned annotations is 0.824 and 0.876 for scene roles and functions respectively.We observe that the Hindi alignment recovers over 90% of the kappa scores from the relevant subset of the double annotation study.
As a second evaluation of this verifier, we assessed the Portuguese part-of-speech annotations with respect to Spanish annotations using a 181token subset of PUD datasets (McDonald et al., 2013) and found a high kappa score of 0.881.

Translate and Verify
A bilingual Hindi-Gujarati annotator translated chapters 4 and 5 of the Gujarati version of The Little Prince into Hindi.The translations were generated such that adpositions were conserved and mapped to their respective counterparts in Hindi.(This is unlike the setting of §5.2 that used an exist-ing translation, and due to which many adpositions were lost in translation.)The translated sentences were subsequently annotated by a Hindi SNACS expert annotator.We ask the Hindi expert to flag any ungrammatical translations, which are avoided.We observe that the targets selected by the Gujarati and Hindi annotators matched 83.0% of the times.That is, the two annotators largely assigned labels to the same tokens (up to translation).We observe a kappa score of 0.802 and 0.837 for scene roles and functions respectively for the tokens identified both in Gujarati and Hindi annotations as targets.

Verification Using Non-expert Annotators
Our experiments for this strategy ( §4.4) focus on the scene role identification task.9Our goal is to set up an annotation task that a native speaker of the target language (Gujarati) can perform without having to read the annotation guidelines.To do so, we provide the annotator with an instance of a sentence with a highlighted adposition, and ask two questions: (1) Given four sentence choices that use the same adposition, which sentence employs the adposition to convey the relation that is most similar to the one conveyed by the highlighted adposition?(Task 1).( 2) Given four supersense definitions for the adposition (attested in the annotated corpus), which one most closely resembles the sense in which the highlighted adposition is used?(Task 2).Appendix B shows the task instructions and example questions.We consider the answers provided to the first question (Task 1) as the priming task for scene role selection (Task 2).
We emphasize that this setup is different from crowd-sourcing of labels: the singly annotated data dictates the choice of the four sentence options and definition choices.The task is closer to annotation verification rather than a fresh round of annotation.
A non-expert native Gujarati speaker performed these annotations.As mentioned earlier, one annotation instance consists of a sentence with a highlighted adposition which serves as the query sentence for the Task 1 and Task 2 questions.To construct the data for this task, we choose one representative sentence from every adposition-supersense pair to act as query sentences. 10Adpositions that have at least four different supersense annotations were considered.We measured agreement between the original annotations and the annotation from the non-expert speaker.We found that the exact agreement for Task 1 and kappa score for Task 2 were 51.4% and 0.547 respectively. 11e repeat this experiment with the original Gujarati expert to verify intra-annotator agreement.This shows the internal consistency of an annotator with the same task.To emulate non-expert conditions, no external resources were referred.We observe an accuracy score of 87.2% on Task 1 and a kappa score of 0.867 for Task 2 which shows reasonable consistency of the expert on a harder split and under stricter conditions.

Commentary on Results
In this section, we compare our double annotation study with the results from the verifiers.
Using Contextual Representations.As discussed in §5.1, we find favorable number of clusters in the original and random settings for the Gujarati SNACS experiment.Subsequently, we get good CRA scores across all representations indicating high agreement.We argue that obtaining high (and identical) affinity with one representation may be a statistical accident, but obtaining high affinity over multiple representations is unlikely to be a mere coincidence (subject, of course, to the caveat that many representations are pretrained on similar datasets).This lends credence to the hypothesis that the annotations are linguistically meaningful.
In CRA, we propose a new verification score.Note that this score is not intended to replace metrics like Cohen's kappa, nor is it a panacea for the difficulties of single-annotator settings.Instead, it provides a new dimension for annotation verification.To validate the metric itself, we show results on several datasets and present an additional study between kappa and CRA.Admittedly, it requires monolingual or multilingual embedding.If a language is not yet represented in such embeddings, one can augment existing pre-trained embeddings via continued pre-training with a small amount of unlabeled text (Ebrahimi and Kann, 2021).
Using Cognate Language Annotation & Translate and Verify.Both methods rely on two sets of expert annotations, albeit, across two cognate languages.We see high kappa scores on both these experiments, comparable to the double-annotation scores.These methods can be useful for high resource-low resource language pairs.Verification Using Non-expert Annotators.Our results suggest that this method underperforms with respect to double annotation.This is expected because we use non-experts.The method also relies on the ability to simplify a task for non-experts, which might not be straightforward.However, we note that our observed kappa scores still fall in the higher end of the 'moderate category' agreement according to Landis and Koch (1977).
7 Discussion and Related Work Dataset quality.Datasets have been be evaluated along various dimensions, e.g., annotation artifacts (e.g., Gururangan et al., 2018) and demographic biases (Bordia and Bowman, 2019;Barikeri et al., 2021).These efforts often use external resources to define evaluation techniques.We can draw parallels between such work and ours in that we use external resources to validate a dimension of dataset quality, i.e., human agreement.(i) prescriptive: where annotators adhere to a prescribed set of guidelines, and (ii) descriptive: which encourages annotators to follow their subjective opinions for annotation.Both IAA and our verifiers are applicable only in the 'prescriptive' scenario.
Labels in a Low-Resource Scenario.The problem of data collection in low-resource settings is not new.Recently, Hedderich et al. ( 2021) presented a survey on low-resource data collection and discuss a range of methods ranging from data augmentation (Feng et al., 2021), distant supervision (Mintz et al., 2009;Ratner et al., 2017), to cross-lingual annotation projection (Plank and Agić, 2018).Active learning (Settles, 2009) can also be useful for efficient annotation.Such methods are meant to facilitate annotation.While these methods are important, we seek to answer the question: How does one measure annotation quality when you have exactly one expert annotator?Gujarati Adpositions.Only a small body of work exists on Gujarati postpositions.Tessitori (1913) trace the origins of the dative and genitive markers in Gujarati to Old Western Rajasthani.They also attempt to explain the use of prototypical dative markers in agentive roles.Turner (1914) argues against the theory.Tisdall (1892) and Doctor (2004) provide an extensive list of postpositions along with conditions for valid syntactic usage.
Our methods should not be seen as replacing multi-annotator efforts, although in such setups, our methods can act as supplementary verifiers.

Conclusions
An inter-annotator agreement study is an essential checklist item before the release of human curated datasets.But inter-annotator agreement cannot be computed when we only have one annotator.We address the open question of verifying singly annotated datasets with a new paradigm of exploiting ancillary resources that can serve as weak surrogates for other annotators.The intuition is that each such verifier provides hints about the dataset quality, and their cumulative success is more likely to point to a good dataset.
We presented four verification strategies that operate in this paradigm, and a new agreement metric (CONTEXTUAL REPRESENTATION AFFIN-ITY).We also created the first semantically focused dataset of adpositions in Gujarati, a lowresource language, in the single-annotator setting.We showed that our verification strategies, when instantiated to the new dataset, are promising and concur with a traditional double-annotation study.

Limitations
We do not propose a solution for extremely lowresource languages, where neither unlabeled text for building language models, nor native speakers are readily available.Examples of such languages include Muscogee, with about 4500 native speakers, and 325 articles in the Muscogee language Wikipedia, and Arapaho with about 1000 speakers and no Wikipedia articles.In such cases, finding even a single expert annotator might be difficult.The development of resources in such languages, however, do not necessarily rest purely on technological factors.
On the technical side, DIRECTPROBE relies on the fact that a representation can be generated for the instance to be annotated.However, obtaining an effective representation for structured annotations (e.g., frames, dialogue states, tables, etc) is nontrivial.While this is a problem, this is orthogonal to our contributions.

A Additional Verifier Experiments
A.1 CRA Scores across Randomization In Figure 4, we show the change in the number of clusters with increasing randomization.Furthermore, we show that the CRA scores decrease with an increasing amount of noise in Gujarati function annotations in Figure 5.

A.2 Complete CRA Results on Gujarati SNACS
The complete set of results on all models and their variants are given in Table 5.

B Native Speaker Verification Instructions and Examples
We show the screenshots of the instructions in Figure 6 and an example question in Figure 7.We show the most frequent supersense assignments and the most frequently occurring construals in Table 6.Table 7 shows label entropies of frequently occurring adpositions.

C.2 Notable Examples and Target Assumptions
Here, we discuss certain examples which are present in Gujarati but are not seen in Hindi.Also, we would point to some linguistic issues and their consequent assumptions.Hindi, in such cases, uses the se adposition which is fairly polysemous and can be used in an Ablative, Comitative, or Instrumentative case (Arora et al., 2022).Corresponding Hindi translations would be "chāb ī se darvāzā kholnā" ((2-a) and (3-a)) and "Sam se ū ṁcā Mark" ((2-b) and (3-b)).
Target Selection Assumptions.
1. Certain tokens like v īshe (about) and kāran .e (due_to) are used to convey TOPIC and EXPLANATION relation.However, etymologically, these can be broken down into   We choose to annotate the entire token given that they exist on the list of postpositions mentioned in Tisdall (1892).
2. Gujarati is also notorious for compound adposition constructions.In certain cases, it is possible to separate out the semantic contribution of the constituent adpositions.Take the instance in example (4).The bolded adposition taraph-th ī contains two adpositions taraph and the ablative th ī.Hence, a DIRECTION and SOURCE annotation would be appropriate for the respective adpositions.On the other hand, take the sentence in example (5).Here, the compound postposition -n ī_bājumā ṁth ī contains three component adpositions -the locatives -n ī_bāju and mā ṁ, and the ablative th ī.However, making distinctions between the semantic contributions of each of these adpositions is not staright-forward.Hence, we avoid the complexities of breaking compound postpositions and assign a single set of labels for the entire expression.

C.3 Double Annotation Disagreements
In this section, we highlight a few examples on which the experts disagreed.Of the 254 adpositions annotated for the double annotation study, only 33 of them had disagreements on either or both scene role and function.
(6) vārtā par īkathān ī māfaq sharū qarvānu ṁ story fairytale.GEN COMP start "start the story like/as a fairytale" In ( 6), the annotators disagreed on the scene role between COMPARISONREF and MANNER.One can make an argument that the former is appropriate as the fairytale acts as a reference point.On the other hand, it can be seen as describing a particular way the story has been started and hence MANNER.In ( 7), the annotators differ with THEME and ANCILLARY annotated for scene roles.If the sheep is seen as the accompanier in the action of walking, it qualifies as an ANCILLARY.If not, the sheep can be seen as the undergoer of the action and hence a THEME.

C.4 Baseline Models
As done for Hindi (Arora et al., 2022), we frame the task as a sequence tagging problem with the supersenses decorated by the BIO scheme.All adpositions are pre-tokenized (separated from the objects) using an existing list of adpositions curated from Gujarati grammar books.12This is to ensure that object embeddings are not utilized during classification.The models are trained in an end-to-end fashion without gold adpositions being provided; that is, they have to both identify and label adpositions.All models use token representations from pre-trained contextualized embeddings as inputs to a linear layer followed by a CRF layer that predicts the BIO labels.Additionally, we trained a classifier to predict supersenses given gold adpositions.This linear classifier uses the mean-pooled target adposition embeddings to predict a probability distribution over the label set.Appendix C.5 gives additional details about the experimental setup.
We conduct experiments using six multilingual models, three of which focus on Indic languages.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.

D
Did you use human annotators (e.g., crowdworkers) or research with human participants?
Left blank.
D1. Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 1 :
Figure 1: DIRECTPROBE clustering results for Scene Roles over varying randomization.

Figure 2 :
Figure 2: CRA scores for Scene Roles over varying randomization in data.The C org values are the ones as shown in Figure 1 for various randomizations.

Figure 3 :
Figure 3: Correlation plot between CRA and Cohen's kappa on TweetNLP dataset.The Pearson correlation between CRA and kappa is 0.915.
Swayamdipta et al. (2020) offer a relevant viewpoint by using training dynamics to analyze the difficulty of learning individual examples in a dataset.Verifiers for Prescriptive Annotation.Rottger et al. (2022) point out two annotation paradigms:

Figure 5 :
Figure 5: CRA scores for Functions over varying randomization.

Figure 6 :
Figure 6: Native Speaker Verification Task Instructions.Rough translations of sentences A: "The pen is in the box.",B: "I am interested in languages.",and C: "The duck is swimming in the lake." sheep.ACC take.CONJ walk.PRF "took the sheep and walked"

Table 1 :
Dataset Statistics.The numbers in the parentheses denote the number of distinct targets/construals.

Table 3 :
to show CRA scores for Gujarati SNACS.We use the entire dataset (3765 targets) for this study.C rand values are rounded to the closest integer.Due to space constraints, we moved base variant results to the appendix.

Table 4 :
CONTEXTUAL REPRESENTATION AFFIN-ITY scores for SNLI, English SNACS and Estonian UPoS.C rand values are rounded off to the closest inte- ger.SR: Scene Role, Fx: Function, L: # labels.how Cohen's kappa and CRA vary with different amounts of annotation noise.Hovy et al. (

Table 5 :
CRA scores for Gujarati SNACS.We use the entire dataset (3765 targets) for this study.C rand values are rounded to the closest integer.

Table 7 :
Entropy of labels by adpositions.Adpositions with a minimum count of 30 were considered v īs .he = v īs .hay(subject) + -e, and kāran .e = kāran .(reason) + -e.This presents a dilemma about annotating the entire token or just the -e.