2023
pdf
bib
abs
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen
|
Le Nam
|
Anh Dau
|
Anh Nguyen
|
Khanh Nghiem
|
Jin Guo
|
Nghi Bui
Findings of the Association for Computational Linguistics: EMNLP 2023
We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.
pdf
bib
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh
|
Nam Le Hai
|
Anh T. V. Dau
|
Anh Minh Nguyen
|
Khanh Nghiem
|
Jin Guo
|
Nghi D. Q. Bui
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
2022
pdf
bib
abs
Continuous Temporal Graph Networks for Event-Based Graph Data
Jin Guo
|
Zhen Han
|
Su Zhou
|
Jiliang Li
|
Volker Tresp
|
Yuyi Wang
Proceedings of the 2nd Workshop on Deep Learning on Graphs for Natural Language Processing (DLG4NLP 2022)
There has been an increasing interest in modeling continuous-time dynamics of temporal graph data. Previous methods encode time-evolving relational information into a low-dimensional representation by specifying discrete layers of neural networks, while real-world dynamic graphs often vary continuously over time. Hence, we propose Continuous Temporal Graph Networks (CTGNs) to capture continuous dynamics of temporal graph data. We use both the link starting timestamps and link duration as evolving information to model continuous dynamics of nodes. The key idea is to use neural ordinary differential equations (ODE) to characterize the continuous dynamics of node representations over dynamic graphs. We parameterize ordinary differential equations using a novel graph neural network. The existing dynamic graph networks can be considered as a specific discretization of CTGNs. Experiment results on both transductive and inductive tasks demonstrate the effectiveness of our proposed approach over competitive baselines.
1998
pdf
bib
Proceedings of the 12th Pacific Asia Conference on Language, Information and Computation
Jin Guo
|
Kim Teng Lua
|
Jie Xu
Proceedings of the 12th Pacific Asia Conference on Language, Information and Computation
pdf
bib
One Tokenization per Source
Jin Guo
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1
pdf
bib
One Tokenization per Source
Jin Guo
COLING 1998 Volume 1: The 17th International Conference on Computational Linguistics
1997
pdf
bib
Longest Tokenization
Jin Guo
International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 2, August 1997
pdf
bib
Critical Tokenization and its Properties
Jin Guo
Computational Linguistics, Volume 23, Number 4, December 1997
1996
pdf
bib
Context-Centered Template Matching for Chinese Lexicon Construction
Jin Guo
Proceedings of Rocling IX Computational Linguistics Conference IX