Data Contamination Report from the 2024 CONDA Shared Task
Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D’Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, Emily Silcock, Kateryna Solonko, David Stap, Mihai Surdeanu, Yu-Min Tseng, Vishaal Udandarao, Zengzhi Wang, Ruijie Xu, Jinglin Yang
Abstract
The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.- Anthology ID:
- 2024.conda-1.4
- Volume:
- Proceedings of the 1st Workshop on Data Contamination (CONDA)
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Oscar Sainz, Iker García Ferrero, Eneko Agirre, Jon Ander Campos, Alon Jacovi, Yanai Elazar, Yoav Goldberg
- Venues:
- CONDA | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 41–56
- Language:
- URL:
- https://aclanthology.org/2024.conda-1.4
- DOI:
- 10.18653/v1/2024.conda-1.4
- Bibkey:
- Cite (ACL):
- Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D’Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, et al.. 2024. Data Contamination Report from the 2024 CONDA Shared Task. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 41–56, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- Data Contamination Report from the 2024 CONDA Shared Task (Sainz et al., CONDA-WS 2024)
- Copy Citation:
- PDF:
- https://aclanthology.org/2024.conda-1.4.pdf
Export citation
@inproceedings{sainz-etal-2024-data, title = "Data Contamination Report from the 2024 {CONDA} Shared Task", author = "Sainz, Oscar and Garc{\'\i}a-Ferrero, Iker and Jacovi, Alon and Ander Campos, Jon and Elazar, Yanai and Agirre, Eneko and Goldberg, Yoav and Chen, Wei-Lin and Chim, Jenny and Choshen, Leshem and D{'}Amico-Wong, Luca and Dell, Melissa and Fan, Run-Ze and Golchin, Shahriar and Li, Yucheng and Liu, Pengfei and Pahwa, Bhavish and Prabhu, Ameya and Sharma, Suryansh and Silcock, Emily and Solonko, Kateryna and Stap, David and Surdeanu, Mihai and Tseng, Yu-Min and Udandarao, Vishaal and Wang, Zengzhi and Xu, Ruijie and Yang, Jinglin", editor = "Sainz, Oscar and Garc{\'\i}a Ferrero, Iker and Agirre, Eneko and Ander Campos, Jon and Jacovi, Alon and Elazar, Yanai and Goldberg, Yoav", booktitle = "Proceedings of the 1st Workshop on Data Contamination (CONDA)", month = aug, year = "2024", address = "Bangkok, Thailand", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2024.conda-1.4", doi = "10.18653/v1/2024.conda-1.4", pages = "41--56", abstract = "The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="sainz-etal-2024-data"> <titleInfo> <title>Data Contamination Report from the 2024 CONDA Shared Task</title> </titleInfo> <name type="personal"> <namePart type="given">Oscar</namePart> <namePart type="family">Sainz</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Iker</namePart> <namePart type="family">García-Ferrero</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alon</namePart> <namePart type="family">Jacovi</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jon</namePart> <namePart type="family">Ander Campos</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yanai</namePart> <namePart type="family">Elazar</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Eneko</namePart> <namePart type="family">Agirre</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yoav</namePart> <namePart type="family">Goldberg</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Wei-Lin</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jenny</namePart> <namePart type="family">Chim</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Leshem</namePart> <namePart type="family">Choshen</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Luca</namePart> <namePart type="family">D’Amico-Wong</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Melissa</namePart> <namePart type="family">Dell</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Run-Ze</namePart> <namePart type="family">Fan</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Shahriar</namePart> <namePart type="family">Golchin</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yucheng</namePart> <namePart type="family">Li</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pengfei</namePart> <namePart type="family">Liu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bhavish</namePart> <namePart type="family">Pahwa</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ameya</namePart> <namePart type="family">Prabhu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Suryansh</namePart> <namePart type="family">Sharma</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Emily</namePart> <namePart type="family">Silcock</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kateryna</namePart> <namePart type="family">Solonko</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">David</namePart> <namePart type="family">Stap</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Mihai</namePart> <namePart type="family">Surdeanu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yu-Min</namePart> <namePart type="family">Tseng</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Vishaal</namePart> <namePart type="family">Udandarao</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Zengzhi</namePart> <namePart type="family">Wang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ruijie</namePart> <namePart type="family">Xu</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jinglin</namePart> <namePart type="family">Yang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2024-08</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 1st Workshop on Data Contamination (CONDA)</title> </titleInfo> <name type="personal"> <namePart type="given">Oscar</namePart> <namePart type="family">Sainz</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Iker</namePart> <namePart type="family">García Ferrero</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Eneko</namePart> <namePart type="family">Agirre</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jon</namePart> <namePart type="family">Ander Campos</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Alon</namePart> <namePart type="family">Jacovi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yanai</namePart> <namePart type="family">Elazar</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Yoav</namePart> <namePart type="family">Goldberg</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Bangkok, Thailand</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community.</abstract> <identifier type="citekey">sainz-etal-2024-data</identifier> <identifier type="doi">10.18653/v1/2024.conda-1.4</identifier> <location> <url>https://aclanthology.org/2024.conda-1.4</url> </location> <part> <date>2024-08</date> <extent unit="page"> <start>41</start> <end>56</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Data Contamination Report from the 2024 CONDA Shared Task %A Sainz, Oscar %A García-Ferrero, Iker %A Jacovi, Alon %A Ander Campos, Jon %A Elazar, Yanai %A Agirre, Eneko %A Goldberg, Yoav %A Chen, Wei-Lin %A Chim, Jenny %A Choshen, Leshem %A D’Amico-Wong, Luca %A Dell, Melissa %A Fan, Run-Ze %A Golchin, Shahriar %A Li, Yucheng %A Liu, Pengfei %A Pahwa, Bhavish %A Prabhu, Ameya %A Sharma, Suryansh %A Silcock, Emily %A Solonko, Kateryna %A Stap, David %A Surdeanu, Mihai %A Tseng, Yu-Min %A Udandarao, Vishaal %A Wang, Zengzhi %A Xu, Ruijie %A Yang, Jinglin %Y Sainz, Oscar %Y García Ferrero, Iker %Y Agirre, Eneko %Y Ander Campos, Jon %Y Jacovi, Alon %Y Elazar, Yanai %Y Goldberg, Yoav %S Proceedings of the 1st Workshop on Data Contamination (CONDA) %D 2024 %8 August %I Association for Computational Linguistics %C Bangkok, Thailand %F sainz-etal-2024-data %X The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in current available datasets and models. The goal of the shared task and associated database is to assist the community in understanding the extent of the problem and to assist researchers in avoiding reporting evaluation results on known contaminated resources. The shared task provides a structured, centralized public database for the collection of contamination evidence, open to contributions from the community via GitHub pool requests. This first compilation paper is based on 566 reported entries over 91 contaminated sources from a total of 23 contributors. The details of the individual contamination events are available in the platform. The platform continues to be online, open to contributions from the community. %R 10.18653/v1/2024.conda-1.4 %U https://aclanthology.org/2024.conda-1.4 %U https://doi.org/10.18653/v1/2024.conda-1.4 %P 41-56
Markdown (Informal)
[Data Contamination Report from the 2024 CONDA Shared Task](https://aclanthology.org/2024.conda-1.4) (Sainz et al., CONDA-WS 2024)
- Data Contamination Report from the 2024 CONDA Shared Task (Sainz et al., CONDA-WS 2024)
ACL
- Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D’Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, et al.. 2024. Data Contamination Report from the 2024 CONDA Shared Task. In Proceedings of the 1st Workshop on Data Contamination (CONDA), pages 41–56, Bangkok, Thailand. Association for Computational Linguistics.