Matyas Bohacek
2024
When XGBoost Outperforms GPT-4 on Text Classification: A Case Study
Matyas Bohacek
|
Michal Bravansky
Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Large language models (LLMs) are increasingly used for applications beyond text generation, ranging from text summarization to instruction following. One popular example of exploiting LLMs’ zero- and few-shot capabilities is the task of text classification. This short paper compares two popular LLM-based classification pipelines (GPT-4 and LLAMA 2) to a popular pre-LLM-era classification pipeline on the task of news trustworthiness classification, focusing on performance, training, and deployment requirements. We find that, in this case, the pre-LLM-era ensemble pipeline outperforms the two popular LLM pipelines while being orders of magnitude smaller in parameter size.
2023
Czech-ing the News: Article Trustworthiness Dataset for Czech
Matyas Bohacek
|
Michal Bravansky
|
Filip Trhlík
|
Vaclav Moravec
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
We present the Verifee dataset: a multimodal dataset of news articles with fine-grained trustworthiness annotations. We bring a diverse set of researchers from social, media, and computer sciences aboard to study this interdisciplinary problem holistically and develop a detailed methodology that assesses the texts through the lens of editorial transparency, journalist conventions, and objective reporting while penalizing manipulative techniques. We collect over 10,000 annotated articles from 60 Czech online news sources. Each item is categorized into one of the 4 proposed classes on the credibility spectrum – ranging from entirely trustworthy articles to deceptive ones – and annotated of its manipulative attributes. We fine-tune prominent sequence-to-sequence language models for the trustworthiness classification task on our dataset and report the best F-1 score of 0.53. We open-source the dataset, annotation methodology, and annotators’ instructions in full length at https://www.verifee.ai/research/ to enable easy build-up work.
2022
Misinformation Detection in the Wild: News Source Classification as a Proxy for Non-article Texts
Matyas Bohacek
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)
Creating classifiers of disinformation is time-consuming, expensive, and requires vast effort from experts spanning different fields. Even when these efforts succeed, their roll-out to publicly available applications stagnates. While these models struggle to find their consumer-accessible use, disinformation behavior online evolves at a pressing speed. The hoaxes get shared in various abbreviations on social networks, often in user-restricted areas, making external monitoring and intervention virtually impossible. To re-purpose existing NLP methods for the new paradigm of sharing misinformation, we propose leveraging information about given texts’ originating news sources to proxy the respective text’s trustworthiness. We first present a methodology for determining the sources’ overall credibility. We demonstrate our pipeline construction in a specific language and introduce CNSC: a novel dataset for Czech articles’ news source and source credibility classification. We constitute initial benchmarks on multiple architectures. Lastly, we create in-the-wild wrapper applications of the trained models: a chatbot, a browser extension, and a standalone web application.