Anaelia Ovalle


2024

pdf bib
Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies
Anaelia Ovalle | Ninareh Mehrabi | Palash Goyal | Jwala Dhamala | Kai-Wei Chang | Richard Zemel | Aram Galstyan | Yuval Pinter | Rahul Gupta
Findings of the Association for Computational Linguistics: NAACL 2024

Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM), such as the inability to correctly use gender-diverse English neopronouns (e.g., xe, zir, fae). While data scarcity is a known culprit, the precise mechanisms through which scarcity affects this behavior remain underexplored. We discover LLM misgendering is significantly influenced by Byte-Pair Encoding (BPE) tokenization, the tokenizer powering many popular LLMs. Unlike binary pronouns, BPE overfragments neopronouns, a direct consequence of data scarcity during tokenizer training. This disparate tokenization mirrors tokenizer limitations observed in multilingual and low-resource NLP, unlocking new misgendering mitigation strategies. We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency. Our proposed methods outperform finetuning with standard BPE, improving neopronoun accuracy from 14.1% to 58.4%. Our paper is the first to link LLM misgendering to tokenization and deficient neopronoun grammar, indicating that LLMs unable to correctly treat neopronouns as pronouns are more prone to misgender.

pdf bib
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Yang (Trista) Cao | Isabel Papadimitriou | Anaelia Ovalle | Marcos Zampieri | Francis Ferraro | Swabha Swayamdipta
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

pdf bib
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Anaelia Ovalle | Kai-Wei Chang | Yang Trista Cao | Ninareh Mehrabi | Jieyu Zhao | Aram Galstyan | Jwala Dhamala | Anoop Kumar | Rahul Gupta
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)

2023

pdf bib
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)
Anaelia Ovalle | Kai-Wei Chang | Ninareh Mehrabi | Yada Pruksachatkun | Aram Galystan | Jwala Dhamala | Apurv Verma | Trista Cao | Anoop Kumar | Rahul Gupta
Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)

2021

pdf bib
Harms of Gender Exclusivity and Challenges in Non-Binary Representation in Language Technologies
Sunipa Dev | Masoud Monajatipoor | Anaelia Ovalle | Arjun Subramonian | Jeff Phillips | Kai-Wei Chang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Gender is widely discussed in the context of language tasks and when examining the stereotypes propagated by language models. However, current discussions primarily treat gender as binary, which can perpetuate harms such as the cyclical erasure of non-binary gender identities. These harms are driven by model and dataset biases, which are consequences of the non-recognition and lack of understanding of non-binary genders in society. In this paper, we explain the complexity of gender and language around it, and survey non-binary persons to understand harms associated with the treatment of gender as binary in English language technologies. We also detail how current language representations (e.g., GloVe, BERT) capture and perpetuate these harms and related challenges that need to be acknowledged and addressed for representations to equitably encode gender information.