2024
pdf
bib
abs
Language Bias in Multilingual Information Retrieval: The Nature of the Beast and Mitigation Methods
Jinrui Yang
|
Fan Jiang
|
Timothy Baldwin
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce ‘LaKDA’, a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.
pdf
bib
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Michael Hahn
|
Alexey Sorokin
|
Ritesh Kumar
|
Andreas Shcherbakov
|
Yulia Otmakhova
|
Jinrui Yang
|
Oleg Serikov
|
Priya Rani
|
Edoardo M. Ponti
|
Saliha Muradoğlu
|
Rena Gao
|
Ryan Cotterell
|
Ekaterina Vylomova
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
2023
pdf
bib
Multi-EuP: The Multilingual European Parliament Dataset for Analysis of Bias in Information Retrieval
Jinrui Yang
|
Timothy Baldwin
|
Trevor Cohn
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)
2022
pdf
bib
abs
Towards Open-Domain Topic Classification
Hantian Ding
|
Jinrui Yang
|
Yuqian Deng
|
Hongming Zhang
|
Dan Roth
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations
We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time. Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface. To obtain such flexibility, we build the backend model in a zero-shot way. By training on a new dataset constructed from Wikipedia, our label-aware text classifier can effectively utilize implicit knowledge in the pretrained language model to handle labels it has never seen before. We evaluate our model across four datasets from various domains with different label sets. Experiments show that the model significantly improves over existing zero-shot baselines in open-domain scenarios, and performs competitively with weakly-supervised models trained on in-domain data.
pdf
bib
abs
Professional Presentation and Projected Power: A Case Study of Implicit Gender Information in English CVs
Jinrui Yang
|
Sheilla Njoto
|
Marc Cheong
|
Leah Ruppanner
|
Lea Frermann
Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS)
Gender discrimination in hiring is a pertinent and persistent bias in society, and a common motivating example for exploring bias in NLP. However, the manifestation of gendered language in application materials has received limited attention. This paper investigates the framing of skills and background in CVs of self-identified men and women. We introduce a data set of 1.8K authentic, English-language, CVs from the US, covering 16 occupations, allowing us to partially control for the confound occupation-specific gender base rates. We find that (1) women use more verbs evoking impressions of low power; and (2) classifiers capture gender signal even after data balancing and removal of pronouns and named entities, and this holds for both transformer-based and linear classifiers.