2024
pdf
bib
abs
Target-Aware Language Modeling via Granular Data Sampling
Ernie Chang
|
Pin-Jie Lin
|
Yang Li
|
Changsheng Zhao
|
Daeil Kim
|
Rastislav Rabatin
|
Zechun Liu
|
Yangyang Shi
|
Vikas Chandra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective and straightforward approach is sampling with low-dimensional data features, which allows selecting large-scale pretraining data for domain-specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the target downstream task performance *while preserving its effectiveness on other tasks*. This leads to the proposed data sampling paradigm where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with ~1% of the data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging from 125M to 1.5B.
pdf
bib
abs
Scaling Parameter-Constrained Language Models with Quality Data
Ernie Chang
|
Matteo Paltenghi
|
Yang Li
|
Pin-Jie Lin
|
Changsheng Zhao
|
Patrick Huber
|
Zechun Liu
|
Rastislav Rabatin
|
Yangyang Shi
|
Vikas Chandra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization.In this paper, we extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we posit to be a critical determinant of performance for parameter-constrained language models.Specifically, we formulate the proposed term of effective training tokens to be a combination of two readily-computed indicators of text:(i) text diversity and (ii) syntheticity as measured by a teacher model.We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores.We demonstrated the estimated constants yield +0.83 Pearson correlation with true accuracies, and analyze it in scenarios involving widely-used data techniques such as data sampling and synthesis which aim to improve data quality.
pdf
bib
abs
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu
|
Barlas Oguz
|
Changsheng Zhao
|
Ernie Chang
|
Pierre Stock
|
Yashar Mehdad
|
Yangyang Shi
|
Raghuraman Krishnamoorthi
|
Vikas Chandra
Findings of the Association for Computational Linguistics: ACL 2024
Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization-aware training for LLMs (LLM-QAT) to push quantization levels even further. We propose a data-free distillation method that leverages generations produced by the pre-trained model, which better preserves the original output distribution and allows quantizing any generative model independent of its training data, similar to post-training quantization methods. In addition to quantizing weights and activations, we also quantize the KV cache, which is critical for increasing throughput and supporting long sequence dependencies at current model sizes. We experiment with LLaMA models of sizes 7B, 13B, and 30B, at quantization levels down to 4-bits. We observe large improvements over training-free methods, especially in the low-bit settings.
2023
pdf
bib
abs
Binary and Ternary Natural Language Generation
Zechun Liu
|
Barlas Oguz
|
Aasish Pappu
|
Yangyang Shi
|
Raghuraman Krishnamoorthi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at:
https://github.com/facebookresearch/Ternary_Binary_Transformer.
pdf
bib
abs
Towards Zero-Shot Multilingual Transfer for Code-Switched Responses
Ting-Wei Wu
|
Changsheng Zhao
|
Ernie Chang
|
Yangyang Shi
|
Pierce Chuang
|
Vikas Chandra
|
Biing Juang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent task-oriented dialog systems have had great success in building English-based personal assistants, but extending these systems to a global audience is challenging due to the need for annotated data in the target language. An alternative approach is to leverage existing data in a high-resource language to enable cross-lingual transfer in low-resource language models. However, this type of transfer has not been widely explored in natural language response generation. In this research, we investigate the use of state-of-the-art multilingual models such as mBART and T5 to facilitate zero-shot and few-shot transfer of code-switched responses. We propose a new adapter-based framework that allows for efficient transfer by learning task-specific representations and encapsulating source and target language representations. Our framework is able to successfully transfer language knowledge even when the target language corpus is limited. We present both quantitative and qualitative analyses to evaluate the effectiveness of our approach.
pdf
bib
abs
Revisiting Sample Size Determination in Natural Language Understanding
Ernie Chang
|
Muhammad Hassan Rashid
|
Pin-Jie Lin
|
Changsheng Zhao
|
Vera Demberg
|
Yangyang Shi
|
Vikas Chandra
Findings of the Association for Computational Linguistics: ACL 2023
Knowing exactly how many data points need to be labeled to achieve a certain model performance is a hugely beneficial step towards reducing the overall budgets for annotation. It pertains to both active learning and traditional data annotation, and is particularly beneficial for low resource scenarios. Nevertheless, it remains a largely under-explored area of research in NLP. We therefore explored various techniques for estimating the training sample size necessary to achieve a targeted performance value. We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples – which serves as an early indicator during data annotation for data quality and sample size determination. We performed ablation studies on four language understanding tasks, and showed that the proposed approach allows us to forecast model performance within a small margin of mean absolute error (~0.9%) with only 10% data.
2016
pdf
bib
Recurrent Support Vector Machines For Slot Tagging In Spoken Language Understanding
Yangyang Shi
|
Kaisheng Yao
|
Hu Chen
|
Dong Yu
|
Yi-Cheng Pan
|
Mei-Yuh Hwang
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
pdf
bib
Deep LSTM based Feature Mapping for Query Classification
Yangyang Shi
|
Kaisheng Yao
|
Le Tian
|
Daxin Jiang
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies