2024
pdf
bib
abs
Re-Evaluating Evaluation for Multilingual Summarization
Jessica Zosa Forde
|
Ruochen Zhang
|
Lintang Sutawika
|
Alham Fikri Aji
|
Samuel Cahyawijaya
|
Genta Indra Winata
|
Minghao Wu
|
Carsten Eickhoff
|
Stella Biderman
|
Ellie Pavlick
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Automatic evaluation approaches (ROUGE, BERTScore, LLM-based evaluators) have been widely used to evaluate summarization tasks. Despite the complexities of script differences and tokenization, these approaches have been indiscriminately applied to summarization across multiple languages. While previous works have argued that these approaches correlate strongly with human ratings in English, it remains unclear whether the conclusion holds for other languages. To answer this question, we construct a small-scale pilot dataset containing article-summary pairs and human ratings in English, Chinese and Indonesian. To measure the strength of summaries, our ratings are measured as head-to-head comparisons with resulting Elo scores across four dimensions. Our analysis reveals that standard metrics are unreliable measures of quality, and that these problems are exacerbated in Chinese and Indonesian. We advocate for more nuanced and careful considerations in designing a robust evaluation framework for multiple languages.
2023
pdf
bib
abs
BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting
Zheng Xin Yong
|
Hailey Schoelkopf
|
Niklas Muennighoff
|
Alham Fikri Aji
|
David Ifeoluwa Adelani
|
Khalid Almubarak
|
M Saiful Bari
|
Lintang Sutawika
|
Jungo Kasai
|
Ahmed Baruwa
|
Genta Winata
|
Stella Biderman
|
Edward Raff
|
Dragomir Radev
|
Vassilina Nikoulina
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The BLOOM model is a large publicly available multilingual language model, but its pretraining was limited to 46 languages. To extend the benefits of BLOOM to other languages without incurring prohibitively large costs, it is desirable to adapt BLOOM to new languages not seen during pretraining. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages in a resource-constrained setting. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, we find that adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages. Our code is available at
https://github.com/bigscience-workshop/multilingual-modeling.
pdf
bib
abs
Crosslingual Generalization through Multitask Finetuning
Niklas Muennighoff
|
Thomas Wang
|
Lintang Sutawika
|
Adam Roberts
|
Stella Biderman
|
Teven Le Scao
|
M Saiful Bari
|
Sheng Shen
|
Zheng Xin Yong
|
Hailey Schoelkopf
|
Xiangru Tang
|
Dragomir Radev
|
Alham Fikri Aji
|
Khalid Almubarak
|
Samuel Albanie
|
Zaid Alyafeai
|
Albert Webson
|
Edward Raff
|
Colin Raffel
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task genrealization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are freely available at
https://github.com/bigscience-workshop/xmtf.
pdf
bib
abs
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Zheng Xin Yong
|
Ruochen Zhang
|
Jessica Forde
|
Skyler Wang
|
Arjun Subramonian
|
Holy Lovenia
|
Samuel Cahyawijaya
|
Genta Winata
|
Lintang Sutawika
|
Jan Christian Blaise Cruz
|
Yin Lin Tan
|
Long Phan
|
Long Phan
|
Rowena Garcia
|
Thamar Solorio
|
Alham Fikri Aji
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching
While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its per-formance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.
pdf
bib
Current Status of NLP in South East Asia with Insights from Multilingualism and Language Diversity
Alham Fikri Aji
|
Jessica Zosa Forde
|
Alyssa Marie Loo
|
Lintang Sutawika
|
Skyler Wang
|
Genta Indra Winata
|
Zheng-Xin Yong
|
Ruochen Zhang
|
A. Seza Doğruöz
|
Yin Lin Tan
|
Jan Christian Blaise Cruz
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract
pdf
bib
Utilizing Weak Supervision to Generate Indonesian Conservation Datasets
Mega Fransiska
|
Diah Pitaloka
|
Saripudin Saripudin
|
Satrio Putra
|
Lintang Sutawika*
Proceedings of the First Workshop in South East Asian Language Processing
2022
pdf
bib
abs
What Language Model to Train if You Have One Million GPU Hours?
Teven Le Scao
|
Thomas Wang
|
Daniel Hesslow
|
Stas Bekman
|
M Saiful Bari
|
Stella Biderman
|
Hady Elsahar
|
Niklas Muennighoff
|
Jason Phang
|
Ofir Press
|
Colin Raffel
|
Victor Sanh
|
Sheng Shen
|
Lintang Sutawika
|
Jaesung Tae
|
Zheng Xin Yong
|
Julien Launay
|
Iz Beltagy
Findings of the Association for Computational Linguistics: EMNLP 2022
The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone.In the process of building BLOOM–the Big Science Large Open-science Open-access Multilingual language model–our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget.Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization.In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience.
pdf
bib
abs
Samsung Research Philippines - Datasaur AI’s Submission for the WMT22 Large Scale Multilingual Translation Task
Jan Christian Blaise Cruz
|
Lintang Sutawika
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper describes the submission of the joint Samsung Research Philippines - Datasaur AI team for the WMT22 Large Scale Multilingual African Translation shared task. We approach the contest as a way to explore task composition as a solution for low-resource multilingual translation, using adapter fusion to combine multiple task adapters that learn subsets of the total translation pairs. Our final model shows performance improvements in 32 out of the 44 translation directions that we participate in when compared to a single model system trained on multiple directions at once.
pdf
bib
abs
Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala
|
Anna Salmela
|
Samuel Rönnqvist
|
Alham Fikri Aji
|
Li-Hsin Chang
|
Asma Dhifallah
|
Larissa Goulart
|
Henna Kortelainen
|
Marc Pàmies
|
Deise Prina Dutra
|
Valtteri Skantsi
|
Lintang Sutawika
|
Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at
https://huggingface.co/datasets/TurkuNLP/register_oscar.
2021
pdf
bib
abs
Data Processing Matters: SRPH-Konvergen AI’s Machine Translation System for WMT’21
Lintang Sutawika
|
Jan Christian Blaise Cruz
Proceedings of the Sixth Conference on Machine Translation
In this paper, we describe the submission of the joint Samsung Research Philippines-Konvergen AI team for the WMT’21 Large Scale Multilingual Translation Task - Small Track 2. We submit a standard Seq2Seq Transformer model to the shared task without any training or architecture tricks, relying mainly on the strength of our data preprocessing techniques to boost performance. Our final submission model scored 22.92 average BLEU on the FLORES-101 devtest set, and scored 22.97 average BLEU on the contest’s hidden test set, ranking us sixth overall. Despite using only a standard Transformer, our model ranked first in Indonesian to Javanese, showing that data preprocessing matters equally, if not more, than cutting edge model architectures and training techniques.