2024
pdf
bib
abs
Sailor: Open Language Models for South-East Asia
Longxu Dou
|
Qian Liu
|
Guangtao Zeng
|
Jia Guo
|
Jiahui Zhou
|
Xin Mao
|
Ziqi Jin
|
Wei Lu
|
Min Lin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present Sailor, a family of open language models ranging from 0.5B to 14B parameters, tailored for South-East Asian (SEA) languages. From Qwen1.5, Sailor models accept 200B to 400B tokens during continual pre-training, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize the data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. We share our insights to spark a wider interest in developing large language models for multilingual use cases.
pdf
bib
abs
Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL
Dingzirui Wang
|
Longxu Dou
|
Xuanliang Zhang
|
Qingfu Zhu
|
Wanxiang Che
Findings of the Association for Computational Linguistics: EMNLP 2024
In-context learning with large language models (LLMs) is the current mainstream method for text-to-SQL. Previous studies have explored selecting relevant demonstrations from a human-labeled demonstration pool, but these methods lack diversity and incur high labeling costs. In this work, we address measuring and enhancing the diversity of the text-to-SQL demonstration pool. First, we introduce a diversity metric and present that the diversity of the existing labeling data can be further enhanced. Motivated by these findings, we propose Fused that iteratively fuses demonstrations to create a diverse demonstration pool based on human labeling or even from scratch with LLMs, reducing labeling costs. Fused achieves an average improvement of 2.1% based on existing labeling and 5.5% from scratch on several mainstream datasets, demonstrating its effectiveness.
pdf
bib
abs
Enhancing Numerical Reasoning with the Guidance of Reliable Reasoning Processes
Dingzirui Wang
|
Longxu Dou
|
Xuanliang Zhang
|
Qingfu Zhu
|
Wanxiang Che
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Numerical reasoning is an essential ability for NLP systems to handle numeric information. Recent research indicates that fine-tuning a small-scale model to learn generating reasoning processes alongside answers can significantly enhance performance. However, current methods have the limitation that most methods generate reasoning processes with large language models (LLMs), which are “unreliable” since such processes could contain information unrelated to the answer. To address this limitation, we introduce enhancing numerical reasoning with reliable processes (Encore), which derives the reliable reasoning process by decomposing the answer formula, ensuring which fully supports the answer. Nevertheless, models could lack enough data to learn the reasoning process generation adequately, since our method generates only one single reasoning process for one formula. To overcome this difficulty, we present a series of pre-training tasks to help models learn the reasoning process generation with synthesized data. The experiments show that Encore yields improvement on all five experimental datasets with an average of 1.8%, proving the effectiveness of our method.
2022
pdf
bib
abs
Towards Knowledge-Intensive Text-to-SQL Semantic Parsing with Formulaic Knowledge
Longxu Dou
|
Yan Gao
|
Xuqi Liu
|
Mingyang Pan
|
Dingzirui Wang
|
Wanxiang Che
|
Dechen Zhan
|
Min-Yen Kan
|
Jian-Guang Lou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
In this paper, we study the problem of knowledge-intensive text-to-SQL, in which domain knowledge is necessary to parse expert questions into SQL queries over domain-specific tables. We formalize this scenario by building a new benchmark KnowSQL consisting of domain-specific questions covering various domains. We then address this problem by representing formulaic knowledge rather than by annotating additional data examples. More concretely, we construct a formulaic knowledge bank as a domain knowledge base and propose a framework (ReGrouP) to leverage this formulaic knowledge during parsing. Experiments using ReGrouP demonstrate a significant 28.2% improvement overall on KnowSQL.
2020
pdf
bib
abs
HIT-SCIR at MRP 2020: Transition-based Parser and Iterative Inference Parser
Longxu Dou
|
Yunlong Feng
|
Yuqiu Ji
|
Wanxiang Che
|
Ting Liu
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
This paper describes our submission system (HIT-SCIR) for the CoNLL 2020 shared task: Cross-Framework and Cross-Lingual Meaning Representation Parsing. The task includes five frameworks for graph-based meaning representations, i.e., UCCA, EDS, PTG, AMR, and DRG. Our solution consists of two sub-systems: transition-based parser for Flavor (1) frameworks (UCCA, EDS, PTG) and iterative inference parser for Flavor (2) frameworks (DRG, AMR). In the final evaluation, our system is ranked 3rd among the seven team both in Cross-Framework Track and Cross-Lingual Track, with the macro-averaged MRP F1 score of 0.81/0.69.
2019
pdf
bib
abs
HIT-SCIR at MRP 2019: A Unified Pipeline for Meaning Representation Parsing via Efficient Training and Effective Encoding
Wanxiang Che
|
Longxu Dou
|
Yang Xu
|
Yuxuan Wang
|
Yijia Liu
|
Ting Liu
Proceedings of the Shared Task on Cross-Framework Meaning Representation Parsing at the 2019 Conference on Natural Language Learning
This paper describes our system (HIT-SCIR) for CoNLL 2019 shared task: Cross-Framework Meaning Representation Parsing. We extended the basic transition-based parser with two improvements: a) Efficient Training by realizing Stack LSTM parallel training; b) Effective Encoding via adopting deep contextualized word embeddings BERT. Generally, we proposed a unified pipeline to meaning representation parsing, including framework-specific transition-based parsers, BERT-enhanced word representation, and post-processing. In the final evaluation, our system was ranked first according to ALL-F1 (86.2%) and especially ranked first in UCCA framework (81.67%).
2018
pdf
bib
abs
Data2Text Studio: Automated Text Generation from Structured Data
Longxu Dou
|
Guanghui Qin
|
Jinpeng Wang
|
Jin-Ge Yao
|
Chin-Yew Lin
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Data2Text Studio is a platform for automated text generation from structured data. It is equipped with a Semi-HMMs model to extract high-quality templates and corresponding trigger conditions from parallel data automatically, which improves the interactivity and interpretability of the generated text. In addition, several easy-to-use tools are provided for developers to edit templates of pre-trained models, and APIs are released for developers to call the pre-trained model to generate texts in third-party applications. We conduct experiments on RotoWire datasets for template extraction and text generation. The results show that our model achieves improvements on both tasks.