WebCrawl African : A Multilingual Parallel Corpora for African Languages
Pavanpankaj Vegi, Sivabhavani J, Biswajit Paul, Abhinav Mishra, Prashant Banjare, Prasanna K R, Chitra Viswanathan
Abstract
WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository.- Anthology ID:
- 2022.wmt-1.105
- Volume:
- Proceedings of the Seventh Conference on Machine Translation (WMT)
- Month:
- December
- Year:
- 2022
- Address:
- Abu Dhabi, United Arab Emirates (Hybrid)
- Editors:
- Philipp Koehn, Loïc Barrault, Ondřej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussà, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Tom Kocmi, André Martins, Makoto Morishita, Christof Monz, Masaaki Nagata, Toshiaki Nakazawa, Matteo Negri, Aurélie Névéol, Mariana Neves, Martin Popel, Marco Turchi, Marcos Zampieri
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1076–1089
- Language:
- URL:
- https://aclanthology.org/2022.wmt-1.105
- DOI:
- Bibkey:
- Cite (ACL):
- Pavanpankaj Vegi, Sivabhavani J, Biswajit Paul, Abhinav Mishra, Prashant Banjare, Prasanna K R, and Chitra Viswanathan. 2022. WebCrawl African : A Multilingual Parallel Corpora for African Languages. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1076–1089, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Cite (Informal):
- WebCrawl African : A Multilingual Parallel Corpora for African Languages (Vegi et al., WMT 2022)
- Copy Citation:
- PDF:
- https://aclanthology.org/2022.wmt-1.105.pdf
Export citation
@inproceedings{vegi-etal-2022-webcrawl, title = "{W}eb{C}rawl {A}frican : A Multilingual Parallel Corpora for {A}frican Languages", author = "Vegi, Pavanpankaj and J, Sivabhavani and Paul, Biswajit and Mishra, Abhinav and Banjare, Prashant and K R, Prasanna and Viswanathan, Chitra", editor = {Koehn, Philipp and Barrault, Lo{\"\i}c and Bojar, Ond{\v{r}}ej and Bougares, Fethi and Chatterjee, Rajen and Costa-juss{\`a}, Marta R. and Federmann, Christian and Fishel, Mark and Fraser, Alexander and Freitag, Markus and Graham, Yvette and Grundkiewicz, Roman and Guzman, Paco and Haddow, Barry and Huck, Matthias and Jimeno Yepes, Antonio and Kocmi, Tom and Martins, Andr{\'e} and Morishita, Makoto and Monz, Christof and Nagata, Masaaki and Nakazawa, Toshiaki and Negri, Matteo and N{\'e}v{\'e}ol, Aur{\'e}lie and Neves, Mariana and Popel, Martin and Turchi, Marco and Zampieri, Marcos}, booktitle = "Proceedings of the Seventh Conference on Machine Translation (WMT)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.wmt-1.105", pages = "1076--1089", abstract = "WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="vegi-etal-2022-webcrawl"> <titleInfo> <title>WebCrawl African : A Multilingual Parallel Corpora for African Languages</title> </titleInfo> <name type="personal"> <namePart type="given">Pavanpankaj</namePart> <namePart type="family">Vegi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sivabhavani</namePart> <namePart type="family">J</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Biswajit</namePart> <namePart type="family">Paul</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Abhinav</namePart> <namePart type="family">Mishra</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Prashant</namePart> <namePart type="family">Banjare</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Prasanna</namePart> <namePart type="family">K R</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chitra</namePart> <namePart type="family">Viswanathan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2022-12</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Seventh Conference on Machine Translation (WMT)</title> </titleInfo> <name type="personal"> <namePart type="given">Philipp</namePart> <namePart type="family">Koehn</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Loïc</namePart> <namePart type="family">Barrault</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Bojar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Fethi</namePart> <namePart type="family">Bougares</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rajen</namePart> <namePart type="family">Chatterjee</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marta</namePart> <namePart type="given">R</namePart> <namePart type="family">Costa-jussà</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christian</namePart> <namePart type="family">Federmann</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mark</namePart> <namePart type="family">Fishel</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alexander</namePart> <namePart type="family">Fraser</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Markus</namePart> <namePart type="family">Freitag</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yvette</namePart> <namePart type="family">Graham</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Roman</namePart> <namePart type="family">Grundkiewicz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Paco</namePart> <namePart type="family">Guzman</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barry</namePart> <namePart type="family">Haddow</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matthias</namePart> <namePart type="family">Huck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Antonio</namePart> <namePart type="family">Jimeno Yepes</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tom</namePart> <namePart type="family">Kocmi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">André</namePart> <namePart type="family">Martins</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Makoto</namePart> <namePart type="family">Morishita</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christof</namePart> <namePart type="family">Monz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Masaaki</namePart> <namePart type="family">Nagata</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Toshiaki</namePart> <namePart type="family">Nakazawa</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matteo</namePart> <namePart type="family">Negri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Aurélie</namePart> <namePart type="family">Névéol</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mariana</namePart> <namePart type="family">Neves</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Martin</namePart> <namePart type="family">Popel</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marco</namePart> <namePart type="family">Turchi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marcos</namePart> <namePart type="family">Zampieri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Abu Dhabi, United Arab Emirates (Hybrid)</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository.</abstract> <identifier type="citekey">vegi-etal-2022-webcrawl</identifier> <location> <url>https://aclanthology.org/2022.wmt-1.105</url> </location> <part> <date>2022-12</date> <extent unit="page"> <start>1076</start> <end>1089</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T WebCrawl African : A Multilingual Parallel Corpora for African Languages %A Vegi, Pavanpankaj %A J, Sivabhavani %A Paul, Biswajit %A Mishra, Abhinav %A Banjare, Prashant %A K R, Prasanna %A Viswanathan, Chitra %Y Koehn, Philipp %Y Barrault, Loïc %Y Bojar, Ondřej %Y Bougares, Fethi %Y Chatterjee, Rajen %Y Costa-jussà, Marta R. %Y Federmann, Christian %Y Fishel, Mark %Y Fraser, Alexander %Y Freitag, Markus %Y Graham, Yvette %Y Grundkiewicz, Roman %Y Guzman, Paco %Y Haddow, Barry %Y Huck, Matthias %Y Jimeno Yepes, Antonio %Y Kocmi, Tom %Y Martins, André %Y Morishita, Makoto %Y Monz, Christof %Y Nagata, Masaaki %Y Nakazawa, Toshiaki %Y Negri, Matteo %Y Névéol, Aurélie %Y Neves, Mariana %Y Popel, Martin %Y Turchi, Marco %Y Zampieri, Marcos %S Proceedings of the Seventh Conference on Machine Translation (WMT) %D 2022 %8 December %I Association for Computational Linguistics %C Abu Dhabi, United Arab Emirates (Hybrid) %F vegi-etal-2022-webcrawl %X WebCrawl African is a mixed domain multilingual parallel corpora for a pool of African languages compiled by ANVITA machine translation team of Centre for Artificial Intelligence and Robotics Lab, primarily for accelerating research on low-resource and extremely low-resource machine translation and is part of the submission to WMT 2022 shared task on Large-Scale Machine Translation Evaluation for African Languages under the data track. The corpora is compiled through web data mining and comprises 695K parallel sentences spanning 74 different language pairs from English and 15 African languages, many of which fall under low and extremely low resource categories. As a measure of corpora usefulness, a MNMT model for 24 African languages to English is trained by combining WebCrawl African corpora with existing corpus and evaluation on FLORES200 shows that inclusion of WebCrawl African corpora could improve BLEU score by 0.01-1.66 for 12 out of 15 African to English translation directions and even by 0.18-0.68 for the 4 out of 9 African to English translation directions which are not part of WebCrawl African corpora. WebCrawl African corpora includes more parallel sentences for many language pairs in comparison to OPUS public repository. This data description paper captures creation of corpora and results obtained along with datasheets. The WebCrawl African corpora is hosted on github repository. %U https://aclanthology.org/2022.wmt-1.105 %P 1076-1089
Markdown (Informal)
[WebCrawl African : A Multilingual Parallel Corpora for African Languages](https://aclanthology.org/2022.wmt-1.105) (Vegi et al., WMT 2022)
- WebCrawl African : A Multilingual Parallel Corpora for African Languages (Vegi et al., WMT 2022)
ACL
- Pavanpankaj Vegi, Sivabhavani J, Biswajit Paul, Abhinav Mishra, Prashant Banjare, Prasanna K R, and Chitra Viswanathan. 2022. WebCrawl African : A Multilingual Parallel Corpora for African Languages. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1076–1089, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.