Handling drafty partial code remains a notable challenge in real-time code suggestion applications. Previous work has demonstrated shortcomings of large language models of code (CodeLLMs) in completing partial code with potential bugs. In this study, we view partial code as implementation hints and fine-tune CodeLLMs to jointly rewrite and complete partial code into functional full programs. We explore two strategies: one-pass generation and multi-pass iterative refinement. We construct new training and testing datasets using semantic-altering code transformations and iterative self-generations.We conduct comprehensive experiments over three representative open-sourced CodeLLMs – InCoder, CodeGen, and StarCoder.Results show that CodeLLMs fine-tuned using our approach achieve superior pass rates compared to the previous baselines across existing and newly-created benchmarks, effectively handle both potentially buggy and clean code, and largely preserve the integrity of the original partial implementations. We further present findings on the properties of the potential bugs we tested and on the design choices of our methods.
Supertagging is an essential task in Categorical grammar parsing and is crucial for dissecting sentence structures. Our research explores the capacity of Large Language Models (LLMs) in supertagging for both Combinatory Categorial Grammar (CCG) and Lambek Categorial Grammar (LCG). We also present a simple method that significantly boosts LLMs, enabling them to outperform LSTM and encoder-based models and achieve state-of-the-art performance. This advancement highlights LLMs’ potential in classification tasks, showcasing their adaptability beyond generative capabilities. Our findings demonstrate the evolving utility of LLMs in natural language processing, particularly in complex tasks like supertagging.
In this work, we introduce a generative model, PLC+, for generating Lambek Categorial Grammar(LCG) sequents. We also introduce a simple method to numerically estimate the model’s parameters from an annotated corpus. Then we compare our model with probabilistic context-free grammars (PCFGs) and show that PLC+ simultaneously assigns a higher probability to a common corpus, and has greater coverage.
In this paper, we define an abstract task called structural realization that generates words given a prefix of words and a partial representation of a parse tree. We also present a method for solving instances of this task using a Gated Graph Neural Network (GGNN). We evaluate it with standard accuracy measures, as well as with respect to perplexity, in which its comparison to previous work on language modelling serves to quantify the information added to a lexical selection task by the presence of syntactic knowledge. That the addition of parse-tree-internal nodes to this neural model should improve the model, with respect both to accuracy and to more conventional measures such as perplexity, may seem unsurprising, but previous attempts have not met with nearly as much success. We have also learned that transverse links through the parse tree compromise the model’s accuracy at generating adjectival and nominal parts of speech.
We approach the problem of generalizing pre-trained word embeddings beyond fixed-size vocabularies without using additional contextual information. We propose a subword-level word vector generation model that views words as bags of character n-grams. The model is simple, fast to train and provides good vectors for rare or unseen words. Experiments show that our model achieves state-of-the-art performances in English word similarity task and in joint prediction of part-of-speech tag and morphosyntactic attributes in 23 languages, suggesting our model’s ability in capturing the relationship between words’ textual representations and their embeddings.