2012
pdf
bib
abs
An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora
K Saravanan
|
Monojit Choudhury
|
Raghavendra Udupa
|
A Kumaran
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies.
2010
pdf
bib
abs
WikiBABEL: A System for Multilingual Wikipedia Content
A. Kumaran
|
Naren Datha
|
B. Ashok
|
K. Saravanan
|
Anil Ande
|
Ashwani Sharma
|
Sridhar Vedantham
|
Vidya Natampally
|
Vikram Dendi
|
Sandor Maurice
Proceedings of the Workshop on Collaborative Translation: technology, crowdsourcing, and the translator perspective
This position paper outlines our project – WikiBABEL – which will be released as an open source project for the creation of multilingual Wikipedia content, and has potential to produce parallel data as a by-product for Machine Translation systems research. We discuss its architecture, functionality and the user-experience components, and briefly present an analysis that emphasizes the resonance that the WikiBABEL design and the planned involvement with Wikipedia has with the open source communities in general and Wikipedians in particular.
2009
pdf
bib
MINT: A Method for Effective and Scalable Mining of Named Entity Transliterations from Large Comparable Corpora
Raghavendra Udupa
|
K Saravanan
|
A Kumaran
|
Jagadeesh Jagarlamudi
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)
pdf
bib
WikiBABEL: A Wiki-style Platform for Creation of Parallel Data
A Kumaran
|
K Saravanan
|
Naren Datha
|
B Ashok
|
Vikram Dendi
Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
2008
pdf
bib
Some Experiments in Mining Named Entity Transliteration Pairs from Comparable Corpora
K Saravanan
|
A Kumaran
Proceedings of the 2nd workshop on Cross Lingual Information Access (CLIA) Addressing the Information Need of Multilingual Societies
pdf
bib
Designing a Common POS-Tagset Framework for Indian Languages
Sankaran Baskaran
|
Kalika Bali
|
Tanmoy Bhattacharya
|
Pushpak Bhattacharyya
|
Girish Nath Jha
|
Rajendran S
|
Saravanan K
|
Sobha L
|
Subbarao K V.
Proceedings of the 6th Workshop on Asian Language Resources
pdf
bib
abs
A Common Parts-of-Speech Tagset Framework for Indian Languages
Baskaran Sankaran
|
Kalika Bali
|
Monojit Choudhury
|
Tanmoy Bhattacharya
|
Pushpak Bhattacharyya
|
Girish Nath Jha
|
S. Rajendran
|
K. Saravanan
|
L. Sobha
|
K.V. Subbarao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We present a universal Parts-of-Speech (POS) tagset framework covering most of the Indian languages (ILs) following the hierarchical and decomposable tagset schema. In spite of significant number of speakers, there is no workable POS tagset and tagger for most ILs, which serve as fundamental building blocks for NLP research. Existing IL POS tagsets are often designed for a specific language; the few that have been designed for multiple languages cover only shallow linguistic features ignoring linguistic richness and the idiosyncrasies. The new framework that is proposed here addresses these deficiencies in an efficient and principled manner. We follow a hierarchical schema similar to that of EAGLES and this enables the framework to be flexible enough to capture rich features of a language/ language family, even while capturing the shared linguistic structures in a methodical way. The proposed common framework further facilitates the sharing and reusability of scarce resources in these languages and ensures cross-linguistic compatibility.