2020
pdf
bib
abs
Abstract Syntax as Interlingua: Scaling Up the Grammatical Framework from Controlled Languages to Robust Pipelines
Aarne Ranta
|
Krasimir Angelov
|
Normunds Gruzitis
|
Prasanth Kolachina
Computational Linguistics, Volume 46, Issue 2 - June 2020
Abstract syntax is an interlingual representation used in compilers. Grammatical Framework (GF) applies the abstract syntax idea to natural languages. The development of GF started in 1998, first as a tool for controlled language implementations, where it has gained an established position in both academic and commercial projects. GF provides grammar resources for over 40 languages, enabling accurate generation and translation, as well as grammar engineering tools and components for mobile and Web applications. On the research side, the focus in the last ten years has been on scaling up GF to wide-coverage language processing. The concept of abstract syntax offers a unified view on many other approaches: Universal Dependencies, WordNets, FrameNets, Construction Grammars, and Abstract Meaning Representations. This makes it possible for GF to utilize data from the other approaches and to build robust pipelines. In return, GF can contribute to data-driven approaches by methods to transfer resources from one language to others, to augment data by rule-based generation, to check the consistency of hand-annotated corpora, and to pipe analyses into high-precision semantic back ends. This article gives an overview of the use of abstract syntax as interlingua through both established and emerging NLP applications involving GF.
2019
pdf
bib
abs
Bootstrapping UD treebanks for Delexicalized Parsing
Prasanth Kolachina
|
Aarne Ranta
Proceedings of the 22nd Nordic Conference on Computational Linguistics
Standard approaches to treebanking traditionally employ a waterfall model (Sommerville, 2010), where annotation guidelines guide the annotation process and insights from the annotation process in turn lead to subsequent changes in the annotation guidelines. This process remains a very expensive step in creating linguistic resources for a target language, necessitates both linguistic expertise and manual effort to develop the annotations and is subject to inconsistencies in the annotation due to human errors. In this paper, we propose an alternative approach to treebanking—one that requires writing grammars. This approach is motivated specifically in the context of Universal Dependencies, an effort to develop uniform and cross-lingually consistent treebanks across multiple languages. We show here that a bootstrapping approach to treebanking via interlingual grammars is plausible and useful in a process where grammar engineering and treebanking are jointly pursued when creating resources for the target language. We demonstrate the usefulness of synthetic treebanks in the task of delexicalized parsing. Our experiments reveal that simple models for treebank generation are cheaper than human annotated treebanks, especially in the lower ends of the learning curves for delexicalized parsing, which is relevant in particular in the context of low-resource languages.
2017
pdf
bib
Replacing OOV Words For Dependency Parsing With Distributional Semantics
Prasanth Kolachina
|
Martin Riedl
|
Chris Biemann
Proceedings of the 21st Nordic Conference on Computational Linguistics
pdf
bib
Cross-Lingual Syntax: Relating Grammatical Framework with Universal Dependencies
Aarne Ranta
|
Prasanth Kolachina
|
Thomas Hallgren
Proceedings of the 21st Nordic Conference on Computational Linguistics
pdf
bib
From Universal Dependencies to Abstract Syntax
Aarne Ranta
|
Prasanth Kolachina
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)
2016
pdf
bib
abs
From Abstract Syntax to Universal Dependencies
Prasanth Kolachina
|
Aarnte Ranta
Linguistic Issues in Language Technology, Volume 13, 2016
Abstract syntax is a semantic tree representation that lies between parse trees and logical forms. It abstracts away from word order and lexical items, but contains enough information to generate both surface strings and logical forms. Abstract syntax is commonly used in compilers as an intermediate between source and target languages. Grammatical Framework (GF) is a grammar formalism that generalizes the idea to natural languages, to capture cross-lingual generalizations and perform interlingual translation. As one of the main results, the GF Resource Grammar Library (GF-RGL) has implemented a shared abstract syntax for over 30 languages. Each language has its own set of concrete syntax rules (morphology and syntax), by which it can be generated from the abstract syntax and parsed into it. This paper presents a conversion method from abstract syntax trees to dependency trees. The method is applied for converting GF-RGL trees to Universal Dependencies (UD), which uses a common set of labels for different languages. The correspondence between GF-RGL and UD turns out to be good, and the relatively few discrepancies give rise to interesting questions about universality. The conversion also has potential for practical applications: (1) it makes the GF parser usable as a rule-based dependency parser; (2) it enables bootstrapping UD treebanks from GF treebanks; (3) it defines formal criteria to assess the informal annotation schemes of UD; (4) it gives a method to check the consistency of manually annotated UD trees with respect to the annotation schemes; (5) it makes information from UD treebanks available.
pdf
bib
International translation in the Grammatical Framework (GF)
Aarne Ranta
|
Kasimir Angelov
|
Thomas Hallgren
|
Prasanth Kolachina
|
Inari Listenmaa
Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products
2015
pdf
bib
GF Wide-coverage English-Finnish MT system for WMT 2015
Prasanth Kolachina
|
Aarne Ranta
Proceedings of the Tenth Workshop on Statistical Machine Translation
2014
pdf
bib
abs
Benchmarking of English-Hindi parallel corpora
Jayendra Rakesh Yeka
|
Prasanth Kolachina
|
Dipti Misra Sharma
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper we present several parallel corpora for EnglishâHindi and talk about their natures and domains. We also discuss briefly a few previous attempts in MT for translation from English to Hindi. The lack of uniformly annotated data makes it difficult to compare these attempts and precisely analyze their strengths and shortcomings. With this in mind, we propose a standard pipeline to provide uniform linguistic annotations to these resources using state-of-art NLP technologies. We conclude the paper by presenting evaluation scores of different statistical MT systems on the corpora detailed in this paper for EnglishâHindi and present the proposed plans for future work. We hope that both these annotated parallel corpora resources and MT systems will serve as benchmarks for future approaches to MT in EnglishâHindi. This was and remains the main motivation for the attempts detailed in this paper.
2012
pdf
bib
Prediction of Learning Curves in Machine Translation
Prasanth Kolachina
|
Nicola Cancedda
|
Marc Dymetman
|
Sriram Venkatapathy
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
pdf
bib
How Good are Typological Distances for Determining Genealogical Relationships among Languages?
Taraka Rama
|
Prasanth Kolachina
Proceedings of COLING 2012: Posters
pdf
bib
abs
Parsing Any Domain English text to CoNLL dependencies
Sudheer Kolachina
|
Prasanth Kolachina
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
It is well known that accuracies of statistical parsers trained over Penn Treebank on test sets drawn from the same corpus tend to be overestimates of their actual parsing performance. This gives rise to the need for evaluation of parsing performance on corpora from different domains. Evaluating multiple parsers on test sets from different domains can give a detailed picture about the relative strengths/weaknesses of different parsing approaches. Such information is also necessary to guide choice of parser in applications such as machine translation where text from multiple domains needs to be handled. In this paper, we report a benchmarking study of different state-of-art parsers for English, both constituency and dependency. The constituency parser output is converted into CoNLL-style dependency trees so that parsing performance can be compared across formalisms. Specifically, we train rerankers for Berkeley and Stanford parsers to study the usefulness of reranking for handling texts from different domains. The results of our experiments lead to interesting insights about the out-of-domain performance of different English parsers.
2010
pdf
bib
abs
Coupling Statistical Machine Translation with Rule-based Transfer and Generation
Arafat Ahsan
|
Prasanth Kolachina
|
Sudheer Kolachina
|
Dipti Misra
|
Rajeev Sangal
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers
In this paper, we present the insights gained from a detailed study of coupling a highly modular English-Hindi RBMT system with a standard phrase-based SMT system. Coupling the RBMT and SMT systems at various stages in the RBMT pipeline, we observe the effects of the source transformations at each stage on the performance of the coupled MT system. We propose an architecture that systematically exploits the structural transfer and robust generation capabilities of the RBMT system. Working with the English-Hindi language pair, we show that the coupling configurations explored in our experiments help address different aspects of the typological divergence between these languages. In spite of working with very small datasets, we report significant improvements both in terms of BLEU (7.14 and 0.87 over the RBMT and the SMT baselines respectively) and subjective evaluation (relative decrease of 17% in SSER).
pdf
bib
Phrase Based Decoding using a Discriminative Model
Prasanth Kolachina
|
Sriram Venkatapathy
|
Srinivas Bangalore
|
Sudheer Kolachina
|
Avinesh PVS
Proceedings of the 4th Workshop on Syntax and Structure in Statistical Translation
pdf
bib
abs
Grammar Extraction from Treebanks for Hindi and Telugu
Prasanth Kolachina
|
Sudheer Kolachina
|
Anil Kumar Singh
|
Samar Husain
|
Viswanath Naidu
|
Rajeev Sangal
|
Akshar Bharati
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Grammars play an important role in many Natural Language Processing (NLP) applications. The traditional approach to creating grammars manually, besides being labor-intensive, has several limitations. With the availability of large scale syntactically annotated treebanks, it is now possible to automatically extract an approximate grammar of a language in any of the existing formalisms from a corresponding treebank. In this paper, we present a basic approach to extract grammars from dependency treebanks of two Indian languages, Hindi and Telugu. The process of grammar extraction requires a generalization mechanism. Towards this end, we explore an approach which relies on generalization of argument structure over the verbs based on their syntactic similarity. Such a generalization counters the effect of data sparseness in the treebanks. A grammar extracted using this system can not only expand already existing knowledge bases for NLP tasks such as parsing, but also aid in the creation of grammars for languages where none exist. Further, we show that the grammar extraction process can help in identifying annotation errors and thus aid in the task of the treebank validation.