Siva Reddy


pdf bib
Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions
Arjun Akula | Spandana Gella | Keze Wang | Song-Chun Zhu | Siva Reddy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural module networks (NMN) are a popular approach for grounding visual referring expressions. Prior implementations of NMN use pre-defined and fixed textual inputs in their module instantiation. This necessitates a large number of modules as they lack the ability to share weights and exploit associations between similar textual contexts (e.g. “dark cube on the left” vs. “black cube on the left”). In this work, we address these limitations and evaluate the impact of contextual clues in improving the performance of NMN models. First, we address the problem of fixed textual inputs by parameterizing the module arguments. This substantially reduce the number of modules in NMN by up to 75% without any loss in performance. Next we propose a method to contextualize our parameterized model to enhance the module’s capacity in exploiting the visiolinguistic associations. Our model outperforms the state-of-the-art NMN model on CLEVR-Ref+ dataset with +8.1% improvement in accuracy on the single-referent test set and +4.3% on the full test set. Additionally, we demonstrate that contextualization provides +11.2% and +1.7% improvements in accuracy over prior NMN models on CLOSURE and NLVR2. We further evaluate the impact of our contextualization by constructing a contrast set for CLEVR-Ref+, which we call CC-Ref+. We significantly outperform the baselines by as much as +10.4% absolute accuracy on CC-Ref+, illustrating the generalization skills of our approach.

pdf bib
Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval
Devang Kulshreshtha | Robert Belfer | Iulian Vlad Serban | Siva Reddy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training generates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6% top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.

pdf bib
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu | Emanuele Bugliarello | Edoardo Maria Ponti | Siva Reddy | Nigel Collier | Desmond Elliott
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.

pdf bib
Minimax and Neyman–Pearson Meta-Learning for Outlier Languages
Edoardo Maria Ponti | Rahul Aralikatte | Disha Shrivastava | Siva Reddy | Anders Søgaard
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Understanding by Understanding Not: Modeling Negation in Language Models
Arian Hosseini | Siva Reddy | Dzmitry Bahdanau | R Devon Hjelm | Alessandro Sordoni | Aaron Courville
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Negation is a core construction in natural language. Despite being very successful on many tasks, state-of-the-art pre-trained language models often handle negation incorrectly. To improve language models in this regard, we propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences from a raw text corpus. By training BERT with the resulting combined objective we reduce the mean top 1 error rate to 4% on the negated LAMA dataset. We also see some improvements on the negated NLI benchmarks.

pdf bib
Explicitly Modeling Syntax in Language Models with Incremental Parsing and a Dynamic Oracle
Yikang Shen | Shawn Tan | Alessandro Sordoni | Siva Reddy | Aaron Courville
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Syntax is fundamental to our thinking about language. Failing to capture the structure of input language could lead to generalization problems and over-parametrization. In the present work, we propose a new syntax-aware language model: Syntactic Ordered Memory (SOM). The model explicitly models the structure with an incremental parser and maintains the conditional probability setting of a standard language model (left-to-right). To train the incremental parser and avoid exposure bias, we also propose a novel dynamic oracle, so that SOM is more robust to wrong parsing decisions. Experiments show that SOM can achieve strong results in language modeling, incremental parsing, and syntactic generalization tests while using fewer parameters than other models.

pdf bib
StereoSet: Measuring stereotypical bias in pretrained language models
Moin Nadeem | Anna Bethke | Siva Reddy
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A stereotype is an over-generalized belief about a particular group of people, e.g., Asians are good at math or African Americans are athletic. Such beliefs (biases) are known to hurt target groups. Since pretrained language models are trained on large real-world data, they are known to capture stereotypical biases. It is important to quantify to what extent these biases are present in them. Although this is a rapidly growing area of research, existing literature lacks in two important aspects: 1) they mainly evaluate bias of pretrained language models on a small set of artificial sentences, even though these models are trained on natural data 2) current evaluations focus on measuring bias without considering the language modeling ability of a model, which could lead to misleading trust on a model even if it is a poor language model. We address both these problems. We present StereoSet, a large-scale natural English dataset to measure stereotypical biases in four domains: gender, profession, race, and religion. We contrast both stereotypical bias and language modeling ability of popular models like BERT, GPT-2, RoBERTa, and XLnet. We show that these models exhibit strong stereotypical biases. Our data and code are available at

pdf bib
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)
Song Feng | Siva Reddy | Malihe Alikhani | He He | Yangfeng Ji | Mohit Iyyer | Zhou Yu
Proceedings of the 1st Workshop on Document-grounded Dialogue and Conversational Question Answering (DialDoc 2021)


pdf bib
MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining
Zhi Wen | Xing Han Lu | Siva Reddy
Proceedings of the 3rd Clinical Natural Language Processing Workshop

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.

pdf bib
Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions
Arjun Akula | Spandana Gella | Yaser Al-Onaizan | Song-Chun Zhu | Siva Reddy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn’t matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn’t. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at

pdf bib
Learning Improvised Chatbots from Adversarial Modifications of Natural Language Feedback
Makesh Narsimhan Sreedhar | Kun Ni | Siva Reddy
Findings of the Association for Computational Linguistics: EMNLP 2020

The ubiquitous nature of dialogue systems and their interaction with users generate an enormous amount of data. Can we improve chatbots using this data? A self-feeding chatbot improves itself by asking natural language feedback when a user is dissatisfied with its response and uses this feedback as an additional training sample. However, user feedback in most cases contains extraneous sequences hindering their usefulness as a training sample. In this work, we propose a generative adversarial model that converts noisy feedback into a plausible natural response in a conversation. The generator’s goal is to convert the feedback into a response that answers the user’s previous utterance and to fool the discriminator which distinguishes feedback from natural responses. We show that augmenting original training data with these modified feedback responses improves the original chatbot performance from 69.94%to 75.96% in ranking correct responses on the PERSONACHATdataset, a large improvement given that the original model is already trained on 131k samples.


pdf bib
Learning an Executable Neural Semantic Parser
Jianpeng Cheng | Siva Reddy | Vijay Saraswat | Mirella Lapata
Computational Linguistics, Volume 45, Issue 1 - March 2019

This article describes a neural semantic parser that maps natural language utterances onto logical forms that can be executed against a task-specific environment, such as a knowledge base or a database, to produce a response. The parser generates tree-structured logical forms with a transition-based approach, combining a generic tree-generation algorithm with domain-general grammar defined by the logical language. The generation process is modeled by structured recurrent neural networks, which provide a rich encoding of the sentential context and generation history for making predictions. To tackle mismatches between natural language and logical form tokens, various attention mechanisms are explored. Finally, we consider different training settings for the neural semantic parser, including fully supervised training where annotated logical forms are given, weakly supervised training where denotations are provided, and distant supervision where only unlabeled sentences and a knowledge base are available. Experiments across a wide range of data sets demonstrate the effectiveness of our parser.

pdf bib
CoQA: A Conversational Question Answering Challenge
Siva Reddy | Danqi Chen | Christopher D. Manning
Transactions of the Association for Computational Linguistics, Volume 7

Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating that there is ample room for improvement. We present CoQA as a challenge to the community at


pdf bib
Learning Typed Entailment Graphs with Global Soft Constraints
Mohammad Javad Hosseini | Nathanael Chambers | Siva Reddy | Xavier R. Holt | Shay B. Cohen | Mark Johnson | Mark Steedman
Transactions of the Association for Computational Linguistics, Volume 6

This paper presents a new method for learning typed entailment graphs from text. We extract predicate-argument structures from multiple-source news corpora, and compute local distributional similarity scores to learn entailments between predicates with typed arguments (e.g., person contracted disease). Previous work has used transitivity constraints to improve local decisions, but these constraints are intractable on large graphs. We instead propose a scalable method that learns globally consistent similarity scores based on new soft constraints that consider both the structures across typed entailment graphs and inside each graph. Learning takes only a few hours to run over 100K predicates and our results show large improvements over local similarity scores on two entailment data sets. We further show improvements over paraphrases and entailments from the Paraphrase Database, and prior state-of-the-art entailment graphs. We show that the entailment graphs improve performance in a downstream task.


pdf bib
Learning Structured Natural Language Representations for Semantic Parsing
Jianpeng Cheng | Siva Reddy | Vijay Saraswat | Mirella Lapata
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a neural semantic parser which is interpretable and scalable. Our model converts natural language utterances to intermediate, domain-general natural language representations in the form of predicate-argument structures, which are induced with a transition system and subsequently mapped to target domains. The semantic parser is trained end-to-end using annotated logical forms or their denotations. We achieve the state of the art on SPADES and GRAPHQUESTIONS and obtain competitive results on GEOQUERY and WEBQUESTIONS. The induced predicate-argument structures shed light on the types of representations useful for semantic parsing and how these are different from linguistically motivated ones.

pdf bib
Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks
Rajarshi Das | Manzil Zaheer | Siva Reddy | Andrew McCallum
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. Universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing Memory networks to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on Spades fill-in-the-blank question answering dataset show that exploiting universal schema for question answering is better than using either a KB or text alone. This model also outperforms the current state-of-the-art by 8.5 F1 points.

pdf bib
Universal Semantic Parsing
Siva Reddy | Oscar Täckström | Slav Petrov | Mark Steedman | Mirella Lapata
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Universal Dependencies (UD) offer a uniform cross-lingual syntactic representation, with the aim of advancing multilingual applications. Recent work shows that semantic parsing can be accomplished by transforming syntactic dependencies to logical forms. However, this work is limited to English, and cannot process dependency graphs, which allow handling complex phenomena such as control. In this work, we introduce UDepLambda, a semantic interface for UD, which maps natural language to logical forms in an almost language-independent fashion and can process dependency graphs. We perform experiments on question answering against Freebase and provide German and Spanish translations of the WebQuestions and GraphQuestions datasets to facilitate multilingual evaluation. Results show that UDepLambda outperforms strong baselines across languages and datasets. For English, it achieves a 4.9 F1 point improvement over the state-of-the-art on GraphQuestions.

pdf bib
Learning to Paraphrase for Question Answering
Li Dong | Jonathan Mallinson | Siva Reddy | Mirella Lapata
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Question answering (QA) systems are sensitive to the many different ways natural language expresses the same information need. In this paper we turn to paraphrases as a means of capturing this knowledge and present a general framework which learns felicitous paraphrases for various QA tasks. Our method is trained end-to-end using question-answer pairs as a supervision signal. A question and its paraphrases serve as input to a neural scoring model which assigns higher weights to linguistic expressions most likely to yield correct answers. We evaluate our approach on QA over Freebase and answer sentence selection. Experimental results on three datasets show that our framework consistently improves performance, achieving competitive results despite the use of simple QA models.

pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib
Universal Dependencies to Logical Form with Negation Scope
Federico Fancellu | Siva Reddy | Adam Lopez | Bonnie Webber
Proceedings of the Workshop Computational Semantics Beyond Events and Roles

Many language technology applications would benefit from the ability to represent negation and its scope on top of widely-used linguistic resources. In this paper, we investigate the possibility of obtaining a first-order logic representation with negation scope marked using Universal Dependencies. To do so, we enhance UDepLambda, a framework that converts dependency graphs to logical forms. The resulting UDepLambda¬ is able to handle phenomena related to scope by means of an higher-order type theory, relevant not only to negation but also to universal quantification and other complex semantic phenomena. The initial conversion we did for English is promising, in that one can represent the scope of negation also in the presence of more complex phenomena such as universal quantifiers.

pdf bib
Predicting Target Language CCG Supertags Improves Neural Machine Translation
Maria Nădejde | Siva Reddy | Rico Sennrich | Tomasz Dwojak | Marcin Junczys-Dowmunt | Philipp Koehn | Alexandra Birch
Proceedings of the Second Conference on Machine Translation


pdf bib
Paraphrase Generation from Latent-Variable PCFGs for Semantic Parsing
Shashi Narayan | Siva Reddy | Shay B. Cohen
Proceedings of the 9th International Natural Language Generation conference

pdf bib
Evaluating Induced CCG Parsers on Grounded Semantic Parsing
Yonatan Bisk | Siva Reddy | John Blitzer | Julia Hockenmaier | Mark Steedman
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Assessing Relative Sentence Complexity using an Incremental CCG Parser
Bharat Ram Ambati | Siva Reddy | Mark Steedman
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Transforming Dependency Structures to Logical Forms for Semantic Parsing
Siva Reddy | Oscar Täckström | Michael Collins | Tom Kwiatkowski | Dipanjan Das | Mark Steedman | Mirella Lapata
Transactions of the Association for Computational Linguistics, Volume 4

The strongly typed syntax of grammar formalisms such as CCG, TAG, LFG and HPSG offers a synchronous framework for deriving syntactic structures and semantic logical forms. In contrast—partly due to the lack of a strong type system—dependency structures are easy to annotate and have become a widely used form of syntactic analysis for many languages. However, the lack of a type system makes a formal mechanism for deriving logical forms from dependency structures challenging. We address this by introducing a robust system based on the lambda calculus for deriving neo-Davidsonian logical forms from dependency trees. These logical forms are then used for semantic parsing of natural language to Freebase. Experiments on the Free917 and Web-Questions datasets show that our representation is superior to the original dependency trees and that it outperforms a CCG-based representation on this task. Compared to prior work, we obtain the strongest result to date on Free917 and competitive results on WebQuestions.

pdf bib
Question Answering on Freebase via Relation Extraction and Textual Evidence
Kun Xu | Siva Reddy | Yansong Feng | Songfang Huang | Dongyan Zhao
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)


pdf bib
Hindi Word Sketches
Anil Krishna Eragani | Varun Kuchib Hotla | Dipti Misra Sharma | Siva Reddy | Adam Kilgarriff
Proceedings of the 11th International Conference on Natural Language Processing

pdf bib
Large-scale Semantic Parsing without Question-Answer Pairs
Siva Reddy | Mirella Lapata | Mark Steedman
Transactions of the Association for Computational Linguistics, Volume 2

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Freebase guided by denotations as a form of weak supervision. Evaluation experiments on a subset of the Free917 and WebQuestions benchmark datasets show our semantic parser improves over the state of the art.


pdf bib
DSS: Text Similarity Using Lexical Alignments of Form, Distributional Semantics and Grammatical Relations
Diana McCarthy | Spandana Gella | Siva Reddy
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Word Sketches for Turkish
Bharat Ram Ambati | Siva Reddy | Adam Kilgarriff
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Word sketches are one-page, automatic, corpus-based summaries of a word's grammatical and collocational behaviour. In this paper we present word sketches for Turkish. Until now, word sketches have been generated using a purpose-built finite-state grammars. Here, we use an existing dependency parser. We describe the process of collecting a 42 million word corpus, parsing it, and generating word sketches from it. We evaluate the word sketches in comparison with word sketches from a language independent sketch grammar on an external evaluation task called topic coherence, using Turkish WordNet to derive an evaluation set of coherent topics.


pdf bib
An Empirical Study on Compositionality in Compound Nouns
Siva Reddy | Diana McCarthy | Suresh Manandhar
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Dynamic and Static Prototype Vectors for Semantic Composition
Siva Reddy | Ioannis Klapaftis | Diana McCarthy | Suresh Manandhar
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Exemplar-Based Word-Space Model for Compositionality Detection: Shared Task System Description
Siva Reddy | Diana McCarthy | Suresh Manandhar | Spandana Gella
Proceedings of the Workshop on Distributional Semantics and Compositionality

pdf bib
Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources
Siva Reddy | Serge Sharoff
Proceedings of the Fifth International Workshop On Cross Lingual Information Access


pdf bib
WSD as a Distributed Constraint Optimization Problem
Siva Reddy | Abhilash Inumella
Proceedings of the ACL 2010 Student Research Workshop

pdf bib
IIITH: Domain Specific Word Sense Disambiguation
Siva Reddy | Abhilash Inumella | Diana McCarthy | Mark Stevenson
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
A Corpus Factory for Many Languages
Adam Kilgarriff | Siva Reddy | Jan Pomikálek | Avinesh PVS
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

For many languages there are no large, general-language corpora available. Until the web, all but the institutions could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build large corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai and Vietnamese. We use the BootCaT method: we take a set of 'seed words' for the language from Wikipedia. Then, several hundred times over, we * randomly select three or four of the seed words * send as a query to Google or Yahoo or Bing, which returns a 'search hits' page * gather the pages that Google or Yahoo point to and save the text. This forms the corpus, which we then * 'clean' (to remove navigation bars, advertisements etc) * remove duplicates * tokenise and (if tools are available) lemmatise and part-of-speech tag * load into our corpus query tool, the Sketch Engine The corpora we have developed are available for use in the Sketch Engine corpus query tool.


pdf bib
All Words Unsupervised Semantic Category Labeling for Hindi
Siva Reddy | Abhilash Inumella | Rajeev Sangal | Soma Paul
Proceedings of the International Conference RANLP-2009