Yang Dong


2022

pdf bib
SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising
Kuan Xu | Yongbo Wang | Yongliang Wang | Zihao Wang | Zujie Wen | Yang Dong
Findings of the Association for Computational Linguistics: NAACL 2022

On the WikiSQL benchmark, most methods tackle the challenge of text-to-SQL with predefined sketch slots and build sophisticated sub-tasks to fill these slots. Though achieving promising results, these methods suffer from over-complex model structure. In this paper, we present a simple yet effective approach that enables auto-regressive sequence-to-sequence model to robust text-to-SQL generation. Instead of formulating the task of text-to-SQL as slot-filling, we propose to train sequence-to-sequence model with Schema-aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These model-agnostic denoising objectives act as the auxiliary tasks for structural data modeling during sequence-to-sequence generation. In addition, we propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of sequence-to-sequence model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. Our work indicates that the capacity of sequence-to-sequence model for text-to-SQL may have been under-estimated and could be enhanced by specialized denoising task.

2012

pdf bib
Spontaneous Speech Corpora for language learners of Spanish, Chinese and Japanese
Antonio Moreno-Sandoval | Leonardo Campillos Llanos | Yang Dong | Emi Takamori | José M. Guirao | Paula Gozalo | Chieko Kimura | Kengo Matsui | Marta Garrote-Salazar
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents a method for designing, compiling and annotating corpora intended for language learners. In particular, we focus on spoken corpora for being used as complementary material in the classroom as well as in examinations. We describe the three corpora (Spanish, Chinese and Japanese) compiled by the Laboratorio de Lingüística Informática at the Autonomous University of Madrid (LLI-UAM). A web-based concordance tool has been used to search for examples in the corpus, and providing the text along with the corresponding audio. Teaching materials from the corpus, consisting the texts, the audio files and exercises on them, are currently on development.