Guojun Wu
2024
Evaluating Automatic Metrics with Incremental Machine Translation Systems
Guojun Wu
|
Shay B Cohen
|
Rico Sennrich
Findings of the Association for Computational Linguistics: EMNLP 2024
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions. Since human A/B testing is commonly used, we assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations. Our study not only confirms several prior findings, such as the advantage of neural metrics over non-neural ones, but also explores the debated issue of how MT quality affects metric reliability—an investigation that smaller datasets in previous research could not sufficiently explore. Overall, our research demonstrates the dataset’s value as a testbed for metric evaluation. We release our code.
Investigating Ableism in LLMs through Multi-turn Conversation
Guojun Wu
|
Sarah Ebling
Proceedings of the Third Workshop on NLP for Positive Impact
To reveal ableism (i.e., bias against persons with disabilities) in large language models (LLMs), we introduce a novel approach involving multi-turn conversations, enabling a comparative assessment. Initially, we prompt the LLM to elaborate short biographies, followed by a request to incorporate information about a disability. Finally, we employ several methods to identify the top words that distinguish the disability-integrated biographies from those without. This comparative setting helps us uncover how LLMs handle disability-related information and reveal underlying biases. We observe that LLMs tend to highlight disabilities in a manner that can be perceived as patronizing or as implying that overcoming challenges is unexpected due to the disability.
An Eye Opener Regarding Task-Based Text Gradient Saliency
Guojun Wu
|
Lena Bolliger
|
David Reich
|
Lena Jäger
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Eye movements in reading reveal humans’ cognitive processes involved in language understanding. The duration a reader’s eyes fixate on a word has been used as a measure of the visual attention given to that word or its significance to the reader. This study investigates the correlation between the importance attributed to input tokens by language models (LMs) on the one hand and humans, in the form of fixation durations, on the other hand. While previous research on the internal processes of LMs have employed the models’ attention weights, recent studies have argued in favor of gradient-based methods. Moreover, previous approaches to interpret LMs’ internals with human gaze have neglected the tasks readers performed during reading, even though psycholinguistic research underlines that reading patterns are task-dependent. We therefore employ a gradient-based saliency method to measure the importance of input tokens when LMs are targeted on specific tasks, and we find that task specificity plays a crucial role in the correlation between human- and model-assigned importance. Our implementation is available at https://github.com/gjwubyron/Scan.
2023
ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding
Guojun Wu
Findings of the Association for Computational Linguistics: EMNLP 2023
Most multilingual vision-and-language (V&L) research aims to accomplish multilingual and multimodal capabilities within one model. However, the scarcity of multilingual captions for images has hindered the development. To overcome this obstacle, we propose ICU, Image Caption Understanding, which divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM), in turn, takes the caption as the alt text and performs cross-lingual language understanding. The burden of multilingual processing is lifted off V&L model and placed on mLM. Since the multilingual text data is relatively of higher abundance and quality, ICU can facilitate the conquering of language barriers for V&L models. In experiments on two tasks across 9 languages in the IGLUE benchmark, we show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
Search
Co-authors
- Shay B. Cohen 1
- Rico Sennrich 1
- Sarah Ebling 1
- Lena Bolliger 1
- David Reich 1
- show all...