Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification

Lele Sha; Yuheng Li; Dragan Gasevic; Guanliang Chen

Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification

Lele Sha, Yuheng Li, Dragan Gasevic, Guanliang Chen

Abstract

Pretrained Language Models (PLMs), though popular, have been diagnosed to encode bias against protected groups in the representations they learn, which may harm the prediction fairness of downstream models. Given that such bias is believed to be related to the amount of demographic information carried in the learned representations, this study aimed to quantify the awareness that a PLM (i.e., BERT) has regarding people’s protected attributes and augment BERT to improve prediction fairness of downstream models by inhibiting this awareness. Specifically, we developed a method to dynamically sample data to continue the pretraining of BERT and enable it to generate representations carrying minimal demographic information, which can be directly used as input to downstream models for fairer predictions. By experimenting on the task of classifying educational forum posts and measuring fairness between students of different gender or first-language backgrounds, we showed that, compared to a baseline without any additional pretraining, our method improved not only fairness (with a maximum improvement of 52.33%) but also accuracy (with a maximum improvement of 2.53%). Our method can be generalized to any PLM and demographic attributes. All the codes used in this study can be accessed via https://github.com/lsha49/FairBERT_deploy.

Anthology ID:: 2022.coling-1.109
Volume:: Proceedings of the 29th International Conference on Computational Linguistics
Month:: October
Year:: 2022
Address:: Gyeongju, Republic of Korea
Editors:: Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, Seung-Hoon Na
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 1275–1285
Language:
URL:: https://aclanthology.org/2022.coling-1.109/
DOI:
Bibkey:
Cite (ACL):: Lele Sha, Yuheng Li, Dragan Gasevic, and Guanliang Chen. 2022. Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1275–1285, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):: Bigger Data or Fairer Data? Augmenting BERT via Active Sampling for Educational Text Classification (Sha et al., COLING 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.coling-1.109.pdf

PDF Cite Search Fix data