Findings of the WMT 2023 Shared Task on Parallel Data Curation

Steve Sloto; Brian Thompson; Huda Khayrallah; Tobias Domhan; Thamme Gowda; Philipp Koehn

doi:10.18653/v1/2023.wmt-1.5

Findings of the WMT 2023 Shared Task on Parallel Data Curation

Steve Sloto, Brian Thompson, Huda Khayrallah, Tobias Domhan, Thamme Gowda, Philipp Koehn

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.

Anthology ID:: 2023.wmt-1.5
Volume:: Proceedings of the Eighth Conference on Machine Translation
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Philipp Koehn, Barry Haddow, Tom Kocmi, Christof Monz
Venue:: WMT
SIG:: SIGMT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 95–102
Language:
URL:: https://aclanthology.org/2023.wmt-1.5/
DOI:: 10.18653/v1/2023.wmt-1.5
Bibkey:
Cite (ACL):: Steve Sloto, Brian Thompson, Huda Khayrallah, Tobias Domhan, Thamme Gowda, and Philipp Koehn. 2023. Findings of the WMT 2023 Shared Task on Parallel Data Curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102, Singapore. Association for Computational Linguistics.
Cite (Informal):: Findings of the WMT 2023 Shared Task on Parallel Data Curation (Sloto et al., WMT 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.wmt-1.5.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{sloto-etal-2023-findings,
    title = "Findings of the {WMT} 2023 Shared Task on Parallel Data Curation",
    author = "Sloto, Steve  and
      Thompson, Brian  and
      Khayrallah, Huda  and
      Domhan, Tobias  and
      Gowda, Thamme  and
      Koehn, Philipp",
    editor = "Koehn, Philipp  and
      Haddow, Barry  and
      Kocmi, Tom  and
      Monz, Christof",
    booktitle = "Proceedings of the Eighth Conference on Machine Translation",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.wmt-1.5/",
    doi = "10.18653/v1/2023.wmt-1.5",
    pages = "95--102",
    abstract = "Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="sloto-etal-2023-findings">
    <titleInfo>
        <title>Findings of the WMT 2023 Shared Task on Parallel Data Curation</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Steve</namePart>
        <namePart type="family">Sloto</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Brian</namePart>
        <namePart type="family">Thompson</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Huda</namePart>
        <namePart type="family">Khayrallah</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Tobias</namePart>
        <namePart type="family">Domhan</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Thamme</namePart>
        <namePart type="family">Gowda</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Philipp</namePart>
        <namePart type="family">Koehn</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023-12</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Eighth Conference on Machine Translation</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Philipp</namePart>
            <namePart type="family">Koehn</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Barry</namePart>
            <namePart type="family">Haddow</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tom</namePart>
            <namePart type="family">Kocmi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Christof</namePart>
            <namePart type="family">Monz</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Singapore</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.</abstract>
    <identifier type="citekey">sloto-etal-2023-findings</identifier>
    <identifier type="doi">10.18653/v1/2023.wmt-1.5</identifier>
    <location>
        <url>https://aclanthology.org/2023.wmt-1.5/</url>
    </location>
    <part>
        <date>2023-12</date>
        <extent unit="page">
            <start>95</start>
            <end>102</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Findings of the WMT 2023 Shared Task on Parallel Data Curation
%A Sloto, Steve
%A Thompson, Brian
%A Khayrallah, Huda
%A Domhan, Tobias
%A Gowda, Thamme
%A Koehn, Philipp
%Y Koehn, Philipp
%Y Haddow, Barry
%Y Kocmi, Tom
%Y Monz, Christof
%S Proceedings of the Eighth Conference on Machine Translation
%D 2023
%8 December
%I Association for Computational Linguistics
%C Singapore
%F sloto-etal-2023-findings
%X Building upon prior WMT shared tasks in document alignment and sentence filtering, we posed the open-ended shared task of finding the best subset of possible training data from a collection of Estonian-Lithuanian web data. Participants could focus on any portion of the end-to-end data curation pipeline, including alignment and filtering. We evaluated results based on downstream machine translation quality. We release processed Common Crawl data, along with various intermediate states from a strong baseline system, which we believe will enable future research on this topic.
%R 10.18653/v1/2023.wmt-1.5
%U https://aclanthology.org/2023.wmt-1.5/
%U https://doi.org/10.18653/v1/2023.wmt-1.5
%P 95-102

Download as File

Markdown (Informal)

[Findings of the WMT 2023 Shared Task on Parallel Data Curation](https://aclanthology.org/2023.wmt-1.5/) (Sloto et al., WMT 2023)

Findings of the WMT 2023 Shared Task on Parallel Data Curation (Sloto et al., WMT 2023)

ACL

Steve Sloto, Brian Thompson, Huda Khayrallah, Tobias Domhan, Thamme Gowda, and Philipp Koehn. 2023. Findings of the WMT 2023 Shared Task on Parallel Data Curation. In Proceedings of the Eighth Conference on Machine Translation, pages 95–102, Singapore. Association for Computational Linguistics.