Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering
Philipp Koehn, Huda Khayrallah, Kenneth Heafield, Mikel L. Forcada
Abstract
We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.- Anthology ID:
- W18-6453
- Volume:
- Proceedings of the Third Conference on Machine Translation: Shared Task Papers
- Month:
- October
- Year:
- 2018
- Address:
- Belgium, Brussels
- Editors:
- Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Christof Monz, Matteo Negri, Aurélie Névéol, Mariana Neves, Matt Post, Lucia Specia, Marco Turchi, Karin Verspoor
- Venue:
- WMT
- SIG:
- SIGMT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 726–739
- Language:
- URL:
- https://aclanthology.org/W18-6453
- DOI:
- 10.18653/v1/W18-6453
- Bibkey:
- Cite (ACL):
- Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L. Forcada. 2018. Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.
- Cite (Informal):
- Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering (Koehn et al., WMT 2018)
- Copy Citation:
- PDF:
- https://aclanthology.org/W18-6453.pdf
Export citation
@inproceedings{koehn-etal-2018-findings, title = "Findings of the {WMT} 2018 Shared Task on Parallel Corpus Filtering", author = "Koehn, Philipp and Khayrallah, Huda and Heafield, Kenneth and Forcada, Mikel L.", editor = "Bojar, Ond{\v{r}}ej and Chatterjee, Rajen and Federmann, Christian and Fishel, Mark and Graham, Yvette and Haddow, Barry and Huck, Matthias and Yepes, Antonio Jimeno and Koehn, Philipp and Monz, Christof and Negri, Matteo and N{\'e}v{\'e}ol, Aur{\'e}lie and Neves, Mariana and Post, Matt and Specia, Lucia and Turchi, Marco and Verspoor, Karin", booktitle = "Proceedings of the Third Conference on Machine Translation: Shared Task Papers", month = oct, year = "2018", address = "Belgium, Brussels", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/W18-6453", doi = "10.18653/v1/W18-6453", pages = "726--739", abstract = "We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1{\%} and 10{\%} of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="koehn-etal-2018-findings"> <titleInfo> <title>Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering</title> </titleInfo> <name type="personal"> <namePart type="given">Philipp</namePart> <namePart type="family">Koehn</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Huda</namePart> <namePart type="family">Khayrallah</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kenneth</namePart> <namePart type="family">Heafield</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mikel</namePart> <namePart type="given">L</namePart> <namePart type="family">Forcada</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2018-10</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Third Conference on Machine Translation: Shared Task Papers</title> </titleInfo> <name type="personal"> <namePart type="given">Ondřej</namePart> <namePart type="family">Bojar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Rajen</namePart> <namePart type="family">Chatterjee</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christian</namePart> <namePart type="family">Federmann</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mark</namePart> <namePart type="family">Fishel</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yvette</namePart> <namePart type="family">Graham</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barry</namePart> <namePart type="family">Haddow</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matthias</namePart> <namePart type="family">Huck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Antonio</namePart> <namePart type="given">Jimeno</namePart> <namePart type="family">Yepes</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philipp</namePart> <namePart type="family">Koehn</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christof</namePart> <namePart type="family">Monz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matteo</namePart> <namePart type="family">Negri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Aurélie</namePart> <namePart type="family">Névéol</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mariana</namePart> <namePart type="family">Neves</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Matt</namePart> <namePart type="family">Post</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lucia</namePart> <namePart type="family">Specia</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Marco</namePart> <namePart type="family">Turchi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Karin</namePart> <namePart type="family">Verspoor</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Belgium, Brussels</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.</abstract> <identifier type="citekey">koehn-etal-2018-findings</identifier> <identifier type="doi">10.18653/v1/W18-6453</identifier> <location> <url>https://aclanthology.org/W18-6453</url> </location> <part> <date>2018-10</date> <extent unit="page"> <start>726</start> <end>739</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering %A Koehn, Philipp %A Khayrallah, Huda %A Heafield, Kenneth %A Forcada, Mikel L. %Y Bojar, Ondřej %Y Chatterjee, Rajen %Y Federmann, Christian %Y Fishel, Mark %Y Graham, Yvette %Y Haddow, Barry %Y Huck, Matthias %Y Yepes, Antonio Jimeno %Y Koehn, Philipp %Y Monz, Christof %Y Negri, Matteo %Y Névéol, Aurélie %Y Neves, Mariana %Y Post, Matt %Y Specia, Lucia %Y Turchi, Marco %Y Verspoor, Karin %S Proceedings of the Third Conference on Machine Translation: Shared Task Papers %D 2018 %8 October %I Association for Computational Linguistics %C Belgium, Brussels %F koehn-etal-2018-findings %X We posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task. %R 10.18653/v1/W18-6453 %U https://aclanthology.org/W18-6453 %U https://doi.org/10.18653/v1/W18-6453 %P 726-739
Markdown (Informal)
[Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering](https://aclanthology.org/W18-6453) (Koehn et al., WMT 2018)
- Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering (Koehn et al., WMT 2018)
ACL
- Philipp Koehn, Huda Khayrallah, Kenneth Heafield, and Mikel L. Forcada. 2018. Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 726–739, Belgium, Brussels. Association for Computational Linguistics.