Shyam Sundar Agrawal
Also published as: Shyam Agrawal
2014
Statistical Analysis of Multilingual Text Corpus and Development of Language Models
Shyam Sundar Agrawal
|
Abhimanue
|
Shweta Bansal
|
Minakshi Mahajan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper presents two studies, first a statistical analysis for three languages i.e. Hindi, Punjabi and Nepali and the other, development of language models for three Indian languages i.e. Indian English, Punjabi and Nepali. The main objective of this study is to find distinction among these languages and development of language models for their identification. Detailed statistical analysis have been done to compute the information about entropy, perplexity, vocabulary growth rate etc. Based on statistical features a comparative analysis has been done to find the similarities and differences among these languages. Subsequently an effort has been made to develop a trigram model of Indian English, Punjabi and Nepali. A corpus of 500000 words of each language has been collected and used to develop their models (unigram, bigram and trigram models). The models have been tried in two different databases- Parallel corpora of French and English and Non-parallel corpora of Indian English, Punjabi and Nepali. In the second case, the performance of the model is comparable. Usage of JAVA platform has provided a special effect for dealing with a very large database with high computational speed. Furthermore various enhancive concepts like Smoothing, Discounting, Back off, and Interpolation have been included for the designing of an effective model. The results obtained from this experiment have been described. The information can be useful for development of Automatic Speech Language Identification System.
2012
Development of Text and Speech database for Hindi and Indian English specific to Mobile Communication environment
Shyam Agrawal
|
Shweta Sinha
|
Pooja Singh
|
Jesper Olson
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the method and experiences of text and speech data collection in mobile communication in Indian English Hindi. The primary data collection is done in the form of large number of messages as part of Personal communication among natives of Hindi language and Indian speakers of English. To gather the versatility of mobile communication database among Hindi and English, 12 domains were identified for collection of text corpus from speaking population belonging to deferent age groups, sex and dialects. The text obtained in raw form based on slangs and unconventional grammar were cleaned using on language grammar rules and then tagged and expanded to explain context specific meaning of the words. Texts of 1163 participants from Hindi speaking regions and 1405 English users were taken for creating 13 prompt sheets; containing 630 phonetically rich sentences created using a special software. Each prompt sheet was recorded by at least 7 users simultaneously in three channels and recorded by a total of 100 speakers and annotated. The work is a step forward in the direction of development of standards for mobile text and speech data collection for Indian languages. Keywords - Speech data base, Text analysis, mobile communication, Hindi and Indian English Speech, multi-lingual speech processing.
Search
Co-authors
- Abhimanue 1
- Shweta Bansal 1
- Minakshi Mahajan 1
- Shweta Sinha 1
- Pooja Singh 1
- show all...
Venues
- lrec2