The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese

Miki Nishioka, Shiro Akasegawa


Abstract
In this paper, we discuss our creation of a web corpus of spoken Hindi (COSH), one of the Indo-Aryan languages spoken mainly in the Indian subcontinent. We also point out notable problems we’ve encountered in the web corpus and the special concordancer. After observing the kind of technical problems we encountered, especially regarding annotation tagged by Shiva Reddy’s tagger, we argue how they can be solved when using COSH for linguistic studies. Finally, we mention the kinds of linguistic research that we non-native speakers of Hindi can do using the corpus, especially in pragmatics and semantics, and from a comparative viewpoint to Japanese.
Anthology ID:
W16-3712
Volume:
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Dekai Wu, Pushpak Bhattacharyya
Venue:
WSSANLP
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
114–123
Language:
URL:
https://aclanthology.org/W16-3712
DOI:
Bibkey:
Cite (ACL):
Miki Nishioka and Shiro Akasegawa. 2016. The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese. In Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016), pages 114–123, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
The development of a web corpus of Hindi language and corpus-based comparative studies to Japanese (Nishioka & Akasegawa, WSSANLP 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-3712.pdf