Multilingual sentence-level bias detection in Wikipedia

Desislava Aleksandrova, François Lareau, Pierre André Ménard


Abstract
We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia’s neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.
Anthology ID:
R19-1006
Volume:
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)
Month:
September
Year:
2019
Address:
Varna, Bulgaria
Editors:
Ruslan Mitkov, Galia Angelova
Venue:
RANLP
SIG:
Publisher:
INCOMA Ltd.
Note:
Pages:
42–51
Language:
URL:
https://aclanthology.org/R19-1006
DOI:
10.26615/978-954-452-056-4_006
Bibkey:
Cite (ACL):
Desislava Aleksandrova, François Lareau, and Pierre André Ménard. 2019. Multilingual sentence-level bias detection in Wikipedia. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 42–51, Varna, Bulgaria. INCOMA Ltd..
Cite (Informal):
Multilingual sentence-level bias detection in Wikipedia (Aleksandrova et al., RANLP 2019)
Copy Citation:
PDF:
https://aclanthology.org/R19-1006.pdf
Code
 crim-ca/wiki-bias