Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline

Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, Jack Rueter


Abstract
This study presents a new dataset on rumor detection in Finnish language news headlines. We have evaluated two different LSTM based models and two different BERT models, and have found very significant differences in the results. A fine-tuned FinBERT reaches the best overall accuracy of 94.3% and rumor label accuracy of 96.0% of the time. However, a model fine-tuned on Multilingual BERT reaches the best factual label accuracy of 97.2%. Our results suggest that the performance difference is due to a difference in the original training data. Furthermore, we find that a regular LSTM model works better than one trained with a pretrained word2vec model. These findings suggest that more work needs to be done for pretrained models in Finnish language as they have been trained on small and biased corpora.
Anthology ID:
2021.nlp4if-1.6
Volume:
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Month:
June
Year:
2021
Address:
Online
Venue:
NLP4IF
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–44
Language:
URL:
https://aclanthology.org/2021.nlp4if-1.6
DOI:
10.18653/v1/2021.nlp4if-1.6
Bibkey:
Cite (ACL):
Mika Hämäläinen, Khalid Alnajjar, Niko Partanen, and Jack Rueter. 2021. Never guess what I heard... Rumor Detection in Finnish News: a Dataset and a Baseline. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, pages 39–44, Online. Association for Computational Linguistics.
Cite (Informal):
Never guess what I heard… Rumor Detection in Finnish News: a Dataset and a Baseline (Hämäläinen et al., NLP4IF 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.nlp4if-1.6.pdf