How newspapers cover news significantly impacts how facts are understood, perceived, and processed by the public. This is especially crucial when serious crimes are reported, e.g., in the case of femicides, where the description of the perpetrator and the victim builds a strong, often polarized opinion of this severe societal issue. This paper presents FMNews, a new dataset of articles reporting femicides extracted from Italian newspapers. Our core contribution aims to promote the development of a deeper framing and awareness of the phenomenon through an original resource available and accessible to the research community, facilitating further analyses on the topic. The paper also provides a preliminary study of the resulting collection through several example use cases and scenarios.
The varied backgrounds and experiences of human annotators inject different opinions and potential biases into the data, inevitably leading to disagreements. Yet, traditional aggregation methods fail to capture individual judgments since they rely on the notion of a single ground truth. Our aim is to review prior contributions to pinpoint the shortcomings that might cause stereotypical content generation. As a preliminary study, our purpose is to investigate state-of-the-art approaches, primarily focusing on the following two research directions. First, we investigate how adding subjectivity aspects to LLMs might guarantee diversity. We then look into the alignment between humans and LLMs and discuss how to measure it. Considering existing gaps, our review explores possible methods to mitigate the perpetuation of biases targeting specific communities. However, we recognize the potential risk of disseminating sensitive information due to the utilization of socio-demographic data in the training process. These considerations underscore the inclusion of diverse perspectives while taking into account the critical importance of implementing robust safeguards to protect individuals’ privacy and prevent the inadvertent propagation of sensitive information.
#Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formulation based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility.