Shruti Rijhwani


2020

pdf bib
AlloVera: A Multilingual Allophone Database
David R. Mortensen | Xinjian Li | Patrick Littell | Alexis Michaud | Shruti Rijhwani | Antonios Anastasopoulos | Alan W Black | Florian Metze | Graham Neubig
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a “universal” allophone model, Allosaurus, built with AlloVera, outperforms “universal” phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.

pdf bib
Improving Candidate Generation for Low-resource Cross-lingual Entity Linking
Shuyan Zhou | Shruti Rijhwani | John Wieting | Jaime Carbonell | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 8

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL.1

pdf bib
Temporally-Informed Analysis of Named Entity Recognition
Shruti Rijhwani | Daniel Preotiuc-Pietro
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural language processing models often have to make predictions on text data that evolves over time as a result of changes in language use or the information described in the text. However, evaluation results on existing data sets are seldom reported by taking the timestamp of the document into account. We analyze and propose methods that make better use of temporally-diverse training data, with a focus on the task of named entity recognition. To support these experiments, we introduce a novel data set of English tweets annotated with named entities. We empirically demonstrate the effect of temporal drift on performance, and how the temporal information of documents can be used to obtain better models compared to those that disregard temporal information. Our analysis gives insights into why this information is useful, in the hope of informing potential avenues of improvement for named entity recognition as well as other NLP tasks under similar experimental setups.

pdf bib
Soft Gazetteers for Low-Resource Named Entity Recognition
Shruti Rijhwani | Shuyan Zhou | Graham Neubig | Jaime Carbonell
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Traditional named entity recognition models use gazetteers (lists of entities) as features to improve performance. Although modern neural network models do not require such hand-crafted features for strong performance, recent work has demonstrated their utility for named entity recognition on English data. However, designing such features for low-resource languages is challenging, because exhaustive entity gazetteers do not exist in these languages. To address this problem, we propose a method of “soft gazetteers” that incorporates ubiquitously available information from English knowledge bases, such as Wikipedia, into neural named entity recognition models through cross-lingual entity linking. Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Shruti Rijhwani | Jiangming Liu | Yizhong Wang | Rotem Dror
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

pdf bib
OCR Post Correction for Endangered Language Texts
Shruti Rijhwani | Antonios Anastasopoulos | Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.

pdf bib
A Summary of the First Workshop on Language Technology for Language Documentation and Revitalization
Graham Neubig | Shruti Rijhwani | Alexis Palmer | Jordan MacKenzie | Hilaria Cruz | Xinjian Li | Matthew Lee | Aditi Chaudhary | Luke Gessler | Steven Abney | Shirley Anugrah Hayati | Antonios Anastasopoulos | Olga Zamaraeva | Emily Prud’hommeaux | Jennette Child | Sara Child | Rebecca Knowles | Sarah Moeller | Jeffrey Micher | Yiyuan Li | Sydney Zink | Mengzhou Xia | Roshan S Sharma | Patrick Littell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

2019

pdf bib
Towards Zero-resource Cross-lingual Entity Linking
Shuyan Zhou | Shruti Rijhwani | Graham Neubig
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Cross-lingual entity linking (XEL) grounds named entities in a source language to an English Knowledge Base (KB), such as Wikipedia. XEL is challenging for most languages because of limited availability of requisite resources. However, many works on XEL have been on simulated settings that actually use significant resources (e.g. source language Wikipedia, bilingual entity maps, multilingual embeddings) that are not available in truly low-resource languages. In this work, we first examine the effect of these resource assumptions and quantify how much the availability of these resource affects overall quality of existing XEL systems. We next propose three improvements to both entity candidate generation and disambiguation that make better use of the limited resources we do have in resource-scarce scenarios. With experiments on four extremely low-resource languages, we show that our model results in gains of 6-20% end-to-end linking accuracy.

pdf bib
Choosing Transfer Languages for Cross-Lingual Learning
Yu-Hsiang Lin | Chian-Yu Chen | Jean Lee | Zirui Li | Yuyan Zhang | Mengzhou Xia | Shruti Rijhwani | Junxian He | Zhisong Zhang | Xuezhe Ma | Antonios Anastasopoulos | Patrick Littell | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual transfer, where a high-resource transfer language is used to improve the accuracy of a low-resource task language, is now an invaluable tool for improving performance of natural language processing (NLP) on low-resource languages. However, given a particular task language, it is not clear which language to transfer from, and the standard strategy is to select languages based on ad hoc criteria, usually the intuition of the experimenter. Since a large number of features contribute to the success of cross-lingual transfer (including phylogenetic similarity, typological properties, lexical overlap, or size of available data), even the most enlightened experimenter rarely considers all these factors for the particular task at hand. In this paper, we consider this task of automatically selecting optimal transfer languages as a ranking problem, and build models that consider the aforementioned features to perform this prediction. In experiments on representative NLP tasks, we demonstrate that our model predicts good transfer languages much better than ad hoc baselines considering single features in isolation, and glean insights on what features are most informative for each different NLP tasks, which may inform future ad hoc selection even without use of our method.

2018

pdf bib
Parser combinators for Tigrinya and Oromo morphology
Patrick Littell | Tom McCoy | Na-Rae Han | Shruti Rijhwani | Zaid Sheikh | David Mortensen | Teruko Mitamura | Lori Levin
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations
Paul Michel | Abhilasha Ravichander | Shruti Rijhwani
Proceedings of the 2nd Workshop on Representation Learning for NLP

We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like tf-idf, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.

pdf bib
Code-Switching as a Social Act: The Case of Arabic Wikipedia Talk Pages
Michael Yoder | Shruti Rijhwani | Carolyn Rosé | Lori Levin
Proceedings of the Second Workshop on NLP and Computational Social Science

Code-switching has been found to have social motivations in addition to syntactic constraints. In this work, we explore the social effect of code-switching in an online community. We present a task from the Arabic Wikipedia to capture language choice, in this case code-switching between Arabic and other languages, as a predictor of social influence in collaborative editing. We find that code-switching is positively associated with Wikipedia editor success, particularly borrowing technical language on pages with topics less directly related to Arabic-speaking regions.

pdf bib
Estimating Code-Switching on Twitter with a Novel Generalized Word-Level Language Detection Technique
Shruti Rijhwani | Royal Sequiera | Monojit Choudhury | Kalika Bali | Chandra Shekhar Maddila
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence. Existing models are restricted to code-switching between two specific languages and fail in real-world scenarios as text input rarely has a priori information on the languages used. We present a novel unsupervised word-level language detection technique for code-switched text for an arbitrarily large number of languages, which does not require any manually annotated training data. Our experiments with tweets in seven languages show a 74% relative error reduction in word-level labeling with respect to competitive baselines. We then use this system to conduct a large-scale quantitative analysis of code-switching patterns on Twitter, both global as well as region-specific, with 58M tweets.

2016

pdf bib
Understanding Language Preference for Expression of Opinion and Sentiment: What do Hindi-English Speakers do on Twitter?
Koustav Rudra | Shruti Rijhwani | Rafiya Begum | Kalika Bali | Monojit Choudhury | Niloy Ganguly
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing