Abstract
Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.- Anthology ID:
- 2021.wat-1.21
- Volume:
- Proceedings of the 8th Workshop on Asian Translation (WAT2021)
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Toshiaki Nakazawa, Hideki Nakayama, Isao Goto, Hideya Mino, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Shohei Higashiyama, Hiroshi Manabe, Win Pa Pa, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Katsuhito Sudoh, Sadao Kurohashi, Pushpak Bhattacharyya
- Venue:
- WAT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 181–190
- Language:
- URL:
- https://aclanthology.org/2021.wat-1.21
- DOI:
- 10.18653/v1/2021.wat-1.21
- Bibkey:
- Cite (ACL):
- Prajit Dhar, Arianna Bisazza, and Gertjan van Noord. 2021. Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 181–190, Online. Association for Computational Linguistics.
- Cite (Informal):
- Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages (Dhar et al., WAT 2021)
- Copy Citation:
- PDF:
- https://aclanthology.org/2021.wat-1.21.pdf
- Data
- PMIndia
Export citation
@inproceedings{dhar-etal-2021-optimal, title = "Optimal Word Segmentation for Neural Machine Translation into {D}ravidian Languages", author = "Dhar, Prajit and Bisazza, Arianna and van Noord, Gertjan", editor = "Nakazawa, Toshiaki and Nakayama, Hideki and Goto, Isao and Mino, Hideya and Ding, Chenchen and Dabre, Raj and Kunchukuttan, Anoop and Higashiyama, Shohei and Manabe, Hiroshi and Pa, Win Pa and Parida, Shantipriya and Bojar, Ond{\v{r}}ej and Chu, Chenhui and Eriguchi, Akiko and Abe, Kaori and Oda, Yusuke and Sudoh, Katsuhito and Kurohashi, Sadao and Bhattacharyya, Pushpak", booktitle = "Proceedings of the 8th Workshop on Asian Translation (WAT2021)", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.wat-1.21", doi = "10.18653/v1/2021.wat-1.21", pages = "181--190", abstract = "Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="dhar-etal-2021-optimal"> <titleInfo> <title>Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages</title> </titleInfo> <name type="personal"> <namePart type="given">Prajit</namePart> <namePart type="family">Dhar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Arianna</namePart> <namePart type="family">Bisazza</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gertjan</namePart> <namePart type="family">van Noord</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2021-08</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 8th Workshop on Asian Translation (WAT2021)</title> </titleInfo> <name type="personal"> <namePart type="given">Toshiaki</namePart> <namePart type="family">Nakazawa</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hideki</namePart> <namePart type="family">Nakayama</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Isao</namePart> <namePart type="family">Goto</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hideya</namePart> <namePart type="family">Mino</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chenchen</namePart> <namePart type="family">Ding</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Raj</namePart> <namePart type="family">Dabre</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Anoop</namePart> <namePart type="family">Kunchukuttan</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Shohei</namePart> <namePart type="family">Higashiyama</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hiroshi</namePart> <namePart type="family">Manabe</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Win</namePart> <namePart type="given">Pa</namePart> <namePart type="family">Pa</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Shantipriya</namePart> <namePart type="family">Parida</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Bojar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chenhui</namePart> <namePart type="family">Chu</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Akiko</namePart> <namePart type="family">Eriguchi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kaori</namePart> <namePart type="family">Abe</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yusuke</namePart> <namePart type="family">Oda</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Katsuhito</namePart> <namePart type="family">Sudoh</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sadao</namePart> <namePart type="family">Kurohashi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pushpak</namePart> <namePart type="family">Bhattacharyya</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Online</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality.</abstract> <identifier type="citekey">dhar-etal-2021-optimal</identifier> <identifier type="doi">10.18653/v1/2021.wat-1.21</identifier> <location> <url>https://aclanthology.org/2021.wat-1.21</url> </location> <part> <date>2021-08</date> <extent unit="page"> <start>181</start> <end>190</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages %A Dhar, Prajit %A Bisazza, Arianna %A van Noord, Gertjan %Y Nakazawa, Toshiaki %Y Nakayama, Hideki %Y Goto, Isao %Y Mino, Hideya %Y Ding, Chenchen %Y Dabre, Raj %Y Kunchukuttan, Anoop %Y Higashiyama, Shohei %Y Manabe, Hiroshi %Y Pa, Win Pa %Y Parida, Shantipriya %Y Bojar, Ondřej %Y Chu, Chenhui %Y Eriguchi, Akiko %Y Abe, Kaori %Y Oda, Yusuke %Y Sudoh, Katsuhito %Y Kurohashi, Sadao %Y Bhattacharyya, Pushpak %S Proceedings of the 8th Workshop on Asian Translation (WAT2021) %D 2021 %8 August %I Association for Computational Linguistics %C Online %F dhar-etal-2021-optimal %X Dravidian languages, such as Kannada and Tamil, are notoriously difficult to translate by state-of-the-art neural models. This stems from the fact that these languages are morphologically very rich as well as being low-resourced. In this paper, we focus on subword segmentation and evaluate Linguistically Motivated Vocabulary Reduction (LMVR) against the more commonly used SentencePiece (SP) for the task of translating from English into four different Dravidian languages. Additionally we investigate the optimal subword vocabulary size for each language. We find that SP is the overall best choice for segmentation, and that larger dictionary sizes lead to higher translation quality. %R 10.18653/v1/2021.wat-1.21 %U https://aclanthology.org/2021.wat-1.21 %U https://doi.org/10.18653/v1/2021.wat-1.21 %P 181-190
Markdown (Informal)
[Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages](https://aclanthology.org/2021.wat-1.21) (Dhar et al., WAT 2021)
- Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages (Dhar et al., WAT 2021)
ACL
- Prajit Dhar, Arianna Bisazza, and Gertjan van Noord. 2021. Optimal Word Segmentation for Neural Machine Translation into Dravidian Languages. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), pages 181–190, Online. Association for Computational Linguistics.