Johannes Villmow
2022
Addressing Leakage in Self-Supervised Contextualized Code Retrieval
Johannes Villmow
|
Viola Campos
|
Adrian Ulges
|
Ulrich Schwanecke
Proceedings of the 29th International Conference on Computational Linguistics
We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that the proposed approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.
2021
ConTest: A Unit Test Completion Benchmark featuring Context
Johannes Villmow
|
Jonas Depoix
|
Adrian Ulges
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)
We introduce CONTEST, a benchmark for NLP-based unit test completion, the task of predicting a test’s assert statements given its setup and focal method, i.e. the method to be tested. ConTest is large-scale (with 365k datapoints). Besides the test code and tested code, it also features context code called by either. We found context to be crucial for accurately predicting assertions. We also introduce baselines based on transformer encoder-decoders, and study the effects of including syntactic information and context. Overall, our models achieve a BLEU score of 38.2, while only generating unparsable code in 1.92% of cases.
2020
Relation Specific Transformations for Open World Knowledge Graph Completion
Haseeb Shah
|
Johannes Villmow
|
Adrian Ulges
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
We propose an open-world knowledge graph completion model that can be combined with common closed-world approaches (such as ComplEx) and enhance them to exploit text-based representations for entities unseen in training. Our model learns relation-specific transformation functions from text-based to graph-based embedding space, where the closed-world link prediction model can be applied. We demonstrate state-of-the-art results on common open-world benchmarks and show that our approach benefits from relation-specific transformation functions (RST), giving substantial improvements over a relation-agnostic approach.