Aryaman Arora


2022

pdf bib
Computational Historical Linguistics and Language Diversity in South Asia
Aryaman Arora | Adam Farris | Samopriya Basu | Suresh Kolichala
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics–fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in and at the intersection of South Asian NLP and historical-comparative linguistics, describing our and others’ current efforts in this area. We also offer new strategies towards breaking the data barrier.

pdf bib
Estimating the Entropy of Linguistic Distributions
Aryaman Arora | Clara Meister | Ryan Cotterell
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Shannon entropy is often a quantity of interest to linguists studying the communicative capacity of human language. However, entropymust typically be estimated from observed data because researchers do not have access to the underlying probability distribution. While entropy estimation is a well-studied problem in other fields, there is not yet a comprehensive exploration of the efficacy of entropy estimators for use with linguistic data. In this work, we fill this void, studying the empirical effectiveness of different entropy estimators for linguistic distributions. In a replication of two recent information-theoretic linguistic studies, we find evidence that the reported effect size is over-estimated due to over-reliance on poor entropy estimators. We end this paper with a concrete recommendation for the entropy estimators that should be used in future linguistic studies.

pdf bib
The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
Khuyagbaatar Batsuren | Gábor Bella | Aryaman Arora | Viktor Martinovic | Kyle Gorman | Zdeněk Žabokrtský | Amarsanaa Ganbold | Šárka Dohnalová | Magda Ševčíková | Kateřina Pelegrinová | Fausto Giunchiglia | Ryan Cotterell | Ekaterina Vylomova
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

pdf bib
SIGMORPHONUniMorph 2022 Shared Task 0: Generalization and Typologically Diverse Morphological Inflection
Jordan Kodner | Salam Khalifa | Khuyagbaatar Batsuren | Hossep Dolatian | Ryan Cotterell | Faruk Akkus | Antonios Anastasopoulos | Taras Andrushko | Aryaman Arora | Nona Atanalov | Gábor Bella | Elena Budianskaya | Yustinus Ghanggo Ate | Omer Goldman | David Guriel | Simon Guriel | Silvia Guriel-Agiashvili | Witold Kieraś | Andrew Krizhanovsky | Natalia Krizhanovsky | Igor Marchenko | Magdalena Markowska | Polina Mashkovtseva | Maria Nepomniashchaya | Daria Rodionova | Karina Scheifer | Alexandra Sorova | Anastasia Yemelina | Jeremiah Young | Ekaterina Vylomova
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

The 2022 SIGMORPHON–UniMorph shared task on large scale morphological inflection generation included a wide range of typologically diverse languages: 33 languages from 11 top-level language families: Arabic (Modern Standard), Assamese, Braj, Chukchi, Eastern Armenian, Evenki, Georgian, Gothic, Gujarati, Hebrew, Hungarian, Itelmen, Karelian, Kazakh, Ket, Khalkha Mongolian, Kholosi, Korean, Lamahalot, Low German, Ludic, Magahi, Middle Low German, Old English, Old High German, Old Norse, Polish, Pomak, Slovak, Turkish, Upper Sorbian, Veps, and Xibe. We emphasize generalization along different dimensions this year by evaluating test items with unseen lemmas and unseen features separately under small and large training conditions. Across the five submitted systems and two baselines, the prediction of inflections with unseen features proved challenging, with average performance decreased substantially from last year. This was true even for languages for which the forms were in principle predictable, which suggests that further work is needed in designing systems that capture the various types of generalization required for the world’s languages.

2021

pdf bib
SNACS Annotation of Case Markers and Adpositions in Hindi
Aryaman Arora | Nitin Venkateswaran | Nathan Schneider
Proceedings of the Society for Computation in Linguistics 2021

pdf bib
Bhāṣācitra: Visualising the dialect geography of South Asia
Aryaman Arora | Adam Farris | Gopalakrishnan R | Samopriya Basu
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

We present Bhāṣācitra, a dialect mapping system for South Asia built on a database of linguistic studies of languages of the region annotated for topic and location data. We analyse language coverage and look towards applications to typology by visualising example datasets. The application is not only meant to be useful for feature mapping, but also serves as a new kind of interactive bibliography for linguists of South Asian languages.

pdf bib
For the Purpose of Curry: A UD Treebank for Ashokan Prakrit
Adam Farris | Aryaman Arora
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

2020

pdf bib
PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English
Michael Kranzlein | Emma Manning | Siyao Peng | Shira Wein | Aryaman Arora | Nathan Schneider
Proceedings of the 14th Linguistic Annotation Workshop

We present the Prepositions Annotated with Supsersense Tags in Reddit International English (“PASTRIE”) corpus, a new dataset containing manually annotated preposition supersenses of English data from presumed speakers of four L1s: English, French, German, and Spanish. The annotations are comprehensive, covering all preposition types and tokens in the sample. Along with the corpus, we provide analysis of distributional patterns across the included L1s and a discussion of the influence of L1s on L2 preposition choice.

pdf bib
Supervised Grapheme-to-Phoneme Conversion of Orthographic Schwas in Hindi and Punjabi
Aryaman Arora | Luke Gessler | Nathan Schneider
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Hindi grapheme-to-phoneme (G2P) conversion is mostly trivial, with one exception: whether a schwa represented in the orthography is pronounced or unpronounced (deleted). Previous work has attempted to predict schwa deletion in a rule-based fashion using prosodic or phonetic analysis. We present the first statistical schwa deletion classifier for Hindi, which relies solely on the orthography as the input and outperforms previous approaches. We trained our model on a newly-compiled pronunciation lexicon extracted from various online dictionaries. Our best Hindi model achieves state of the art performance, and also achieves good performance on a closely related language, Punjabi, without modification.