Addressing Leakage in Self-Supervised Contextualized Code Retrieval
Johannes Villmow, Viola Campos, Adrian Ulges, Ulrich Schwanecke
Abstract
We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.- Anthology ID:
- 2022.coling-1.84
- Volume:
- Proceedings of the 29th International Conference on Computational Linguistics
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Editors:
- Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
- Venue:
- COLING
- SIG:
- Publisher:
- International Committee on Computational Linguistics
- Note:
- Pages:
- 1006–1013
- Language:
- URL:
- https://aclanthology.org/2022.coling-1.84
- DOI:
- Bibkey:
- Cite (ACL):
- Johannes Villmow, Viola Campos, Adrian Ulges, and Ulrich Schwanecke. 2022. Addressing Leakage in Self-Supervised Contextualized Code Retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1006–1013, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Cite (Informal):
- Addressing Leakage in Self-Supervised Contextualized Code Retrieval (Villmow et al., COLING 2022)
- Copy Citation:
- PDF:
- https://aclanthology.org/2022.coling-1.84.pdf
- Data
- CodeXGLUE
Export citation
@inproceedings{villmow-etal-2022-addressing, title = "Addressing Leakage in Self-Supervised Contextualized Code Retrieval", author = "Villmow, Johannes and Campos, Viola and Ulges, Adrian and Schwanecke, Ulrich", editor = "Calzolari, Nicoletta and Huang, Chu-Ren and Kim, Hansaem and Pustejovsky, James and Wanner, Leo and Choi, Key-Sun and Ryu, Pum-Mo and Chen, Hsin-Hsi and Donatelli, Lucia and Ji, Heng and Kurohashi, Sadao and Paggio, Patrizia and Xue, Nianwen and Kim, Seokhwan and Hahm, Younggyun and He, Zhong and Lee, Tony Kyungil and Santus, Enrico and Bond, Francis and Na, Seung-Hoon", booktitle = "Proceedings of the 29th International Conference on Computational Linguistics", month = oct, year = "2022", address = "Gyeongju, Republic of Korea", publisher = "International Committee on Computational Linguistics", url = "https://aclanthology.org/2022.coling-1.84", pages = "1006--1013", abstract = "We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.", }
<?xml version="1.0" encoding="UTF-8"?> <modsCollection xmlns="http://www.loc.gov/mods/v3"> <mods ID="villmow-etal-2022-addressing"> <titleInfo> <title>Addressing Leakage in Self-Supervised Contextualized Code Retrieval</title> </titleInfo> <name type="personal"> <namePart type="given">Johannes</namePart> <namePart type="family">Villmow</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Viola</namePart> <namePart type="family">Campos</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Adrian</namePart> <namePart type="family">Ulges</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Ulrich</namePart> <namePart type="family">Schwanecke</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2022-10</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 29th International Conference on Computational Linguistics</title> </titleInfo> <name type="personal"> <namePart type="given">Nicoletta</namePart> <namePart type="family">Calzolari</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chu-Ren</namePart> <namePart type="family">Huang</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hansaem</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">James</namePart> <namePart type="family">Pustejovsky</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Leo</namePart> <namePart type="family">Wanner</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Key-Sun</namePart> <namePart type="family">Choi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Pum-Mo</namePart> <namePart type="family">Ryu</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hsin-Hsi</namePart> <namePart type="family">Chen</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Lucia</namePart> <namePart type="family">Donatelli</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Heng</namePart> <namePart type="family">Ji</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sadao</namePart> <namePart type="family">Kurohashi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Patrizia</namePart> <namePart type="family">Paggio</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nianwen</namePart> <namePart type="family">Xue</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Seokhwan</namePart> <namePart type="family">Kim</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Younggyun</namePart> <namePart type="family">Hahm</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Zhong</namePart> <namePart type="family">He</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Tony</namePart> <namePart type="given">Kyungil</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Enrico</namePart> <namePart type="family">Santus</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Francis</namePart> <namePart type="family">Bond</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Seung-Hoon</namePart> <namePart type="family">Na</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>International Committee on Computational Linguistics</publisher> <place> <placeTerm type="text">Gyeongju, Republic of Korea</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.</abstract> <identifier type="citekey">villmow-etal-2022-addressing</identifier> <location> <url>https://aclanthology.org/2022.coling-1.84</url> </location> <part> <date>2022-10</date> <extent unit="page"> <start>1006</start> <end>1013</end> </extent> </part> </mods> </modsCollection>
%0 Conference Proceedings %T Addressing Leakage in Self-Supervised Contextualized Code Retrieval %A Villmow, Johannes %A Campos, Viola %A Ulges, Adrian %A Schwanecke, Ulrich %Y Calzolari, Nicoletta %Y Huang, Chu-Ren %Y Kim, Hansaem %Y Pustejovsky, James %Y Wanner, Leo %Y Choi, Key-Sun %Y Ryu, Pum-Mo %Y Chen, Hsin-Hsi %Y Donatelli, Lucia %Y Ji, Heng %Y Kurohashi, Sadao %Y Paggio, Patrizia %Y Xue, Nianwen %Y Kim, Seokhwan %Y Hahm, Younggyun %Y He, Zhong %Y Lee, Tony Kyungil %Y Santus, Enrico %Y Bond, Francis %Y Na, Seung-Hoon %S Proceedings of the 29th International Conference on Computational Linguistics %D 2022 %8 October %I International Committee on Computational Linguistics %C Gyeongju, Republic of Korea %F villmow-etal-2022-addressing %X We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection. %U https://aclanthology.org/2022.coling-1.84 %P 1006-1013
Markdown (Informal)
[Addressing Leakage in Self-Supervised Contextualized Code Retrieval](https://aclanthology.org/2022.coling-1.84) (Villmow et al., COLING 2022)
- Addressing Leakage in Self-Supervised Contextualized Code Retrieval (Villmow et al., COLING 2022)
ACL
- Johannes Villmow, Viola Campos, Adrian Ulges, and Ulrich Schwanecke. 2022. Addressing Leakage in Self-Supervised Contextualized Code Retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1006–1013, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.