In this work, we address the challenge of identifying the inference relation between a plain language statement and Clinical Trial Reports (CTRs) by using a T5-large model embedding. The task, hosted at SemEval-2024, involves the use of the NLI4CT dataset. Each instance in the dataset has one or two CTRs, along with an annotation from domain experts, a section marker, a statement, and an entailment/contradiction label. The goal is to determine if a statement entails or contradicts the given information within a trial description. Our submission consists of a T5-large model pre-trained on the medical domain. Then the pre-trained model embedding output provides the embedding representation of the text. Eventually, after a fine-tuning phase, the provided embeddings are used to determine the CTRs’ and the statements’ cosine similarity to perform the classification. On the official test set, our submitted approach is able to reach an F1 score of 0.63, and a faithfulness and consistency score of 0.30 and 0.50 respectively.
One of the most widely used content types in internet misinformation campaigns is memes. Since they can readily reach a big number of users on social media sites, they are most successful there. Memes used in a disinformation campaign include a variety of rhetorical and psychological strategies, including smearing, name-calling, and causal oversimplification, to achieve their goal of influencing the users. The shared task’s objective is to develop models for recognizing these strategies solely in a meme’s textual content (Subtask 1) and in a multimodal context where both the textual and visual material must be analysed simultaneously (Subtasks two and three). In this paper, we discuss the application of a Mistral 7B model to address the Subtask one in English. Find the persuasive strategy that a meme employs from a hierarchy of twenty based just on its “textual content.” Only a portion of the reward is awarded if the technique’s ancestor node is chosen. This classification issue is multilabel hierarchical. Our approach based on the use of a Mistral 7B model obtains a Hierarchical F1 of 0.42 a Hierarchical Precision of 0.30 and a Hierarchical Recall of 0.71. Our selected approach is able to outperform the baseline provided for the competition.
Participants in the SemEval-2024 Task 6 were tasked with executing binary classification aimed at discerning instances of fluent overgeneration hallucinations across two distinct setups: the model-aware and model-agnostic tracks. That is, participants must detect grammatically sound outputs which contain incorrect or unsupported semantic information, regardless of whether they had access to the model responsible for producing the output or not, within the model-aware and model-agnostic tracks. Two tracks were proposed for the task: a model-aware track, where organizers provided a checkpoint to a model publicly available on HuggingFace for every data point considered, and a model-agnostic track where the organizers do not. In this paper, we discuss the application of a Llama model to address both the tracks. Find the persuasive strategy that a meme employs from a hierarchy of twenty based just on its “textual content.” Only a portion of the reward is awarded if the technique’s ancestor node is chosen. This classification issue is multilabel hierarchical. Our approach reaches an accuracy of 0.62 on the agnostic track and of 0.67 on the aware track.
At the SemEval-2024 Task 5, the organizers introduce a novel natural language processing (NLP) challenge and dataset within the realm of the United States civil procedure. Each datum within the dataset comprises a comprehensive overview of a legal case, a specific inquiry associated with it, and a potential argument in support of a solution, supplemented with an in-depth rationale elucidating the applicability of the argument within the given context. Derived from a text designed for legal education purposes, this dataset presents a multifaceted benchmarking task for contemporary legal language models. Our manuscript delineates the approach we adopted for participation in this competition. Specifically, we detail the use of a Mistral 7B model to answer the question provided. Our only and best submission reach an F1-score equal to 0.5597 and an Accuracy of 0.5714, outperforming the baseline provided for the task.
The rise of Large Language Models (LLMs) has brought about a notable shift, rendering them increasingly ubiquitous and readily accessible. This accessibility has precipitated a surge in machine-generated content across diverse platforms encompassing news outlets, social media platforms, question-answering forums, educational platforms, and even academic domains. Recent iterations of LLMs, exemplified by entities like ChatGPT and GPT-4, exhibit a remarkable ability to produce coherent and contextually relevant responses across a broad spectrum of user inquiries. The fluidity and sophistication of these generated texts position LLMs as compelling candidates for substituting human labor in numerous applications. Nevertheless, this proliferation of machine-generated content has raised apprehensions regarding potential misuse, including the dissemination of misinformation and disruption of educational ecosystems. Given that humans marginally outperform random chance in discerning between machine-generated and human-authored text, there arises a pressing imperative to develop automated systems capable of accurately distinguishing machine-generated text. This pursuit is driven by the overarching objective of curbing the potential misuse of machine-generated content. Our manuscript delineates the approach we adopted for participation in this competition. Specifically, we detail the use of a DistilBERT model for classifying each sample in the test set provided. Our submission is able to reach an accuracy equal to 0.754 in place of the worst result obtained at the competition that is equal to 0.231.
The widespread success of language models has spurred the natural language processing (NLP) community to tackle tasks demanding implicit and intricate reasoning, drawing upon human-like common-sense mechanisms. While endeavors in vertical thinking tasks have garnered considerable attention, there has been a relative dearth of exploration in lateral thinking puzzles. To address this gap, we introduce BRAINTEASER: a multiple-choice Question Answering task meticulously crafted to evaluate the model’s capacity for lateral thinking and its ability to challenge default common-sense associations. At the SemEval-2024 Task 9, for the first subtask (i.e., Sentence Puzzle) the organizers asked the participants to develop models able to reply to multi-answer brain-teasing questions. For this purpose, we propose the application of a DeBERTa model in a zero-shot configuration. Our proposed approach is able to reach an overall score of 0.250. Suggesting a significant room for improvements in future works.
The EDiReF shared task at SemEval 2024 comprises three subtasks: Emotion Recognition in Conversation (ERC) in Hindi-English code-mixed conversations, Emotion Flip Reasoning (EFR) in Hindi-English code-mixed conversations, and EFR in English conversations. The objectives for the ERC and EFR tasks are defined as follows: 1) Emotion Recognition in Conversation (ERC): In this task, participants are tasked with assigning an emotion to each utterance within a dialogue from a predefined set of possible emotions. The goal is to accurately recognize and label the emotions expressed in the conversation; 2) Emotion Flip Reasoning (EFR): This task involves identifying the trigger utterance(s) for an emotion-flip within a multi-party conversation dialogue. Participants are required to pinpoint the specific utterance(s) that serve as catalysts for a change in emotion during the conversation. In this paper we only address the first subtask (ERC) making use of an online translation strategy followed by the application of a Mistral 7B model together with a few-shot prompt strategy. Our approach obtains an F1 of 0.36, eventually exhibiting further room for improvements.
In this study, we tackle the task of automatically discerning the level of semantic relatedness between pairs of sentences. Specifically, Task 1 at SemEval-2024 involves predicting the Semantic Textual Relatedness (STR) of sentence pairs. Participants are tasked with ranking sentence pairs based on their proximity in meaning, quantified by their degree of semantic relatedness, across 14 different languages. Each sentence pair is assigned manually determined relatedness scores ranging from 0 (indicating complete lack of relation) to 1 (denoting maximum relatedness). In our submitted approach on the official test set, focusing on Task 1 (a supervised task in English and Spanish), we achieve a Spearman rank correlation coefficient of 0.808 for the English language and 0.611 for the Spanish language.
In this paper we propose four deep learning models for the task of detecting and classifying Patronizing and Condescending Language (PCL) using a corpus of over 13,000 annotated paragraphs in English. The task, hosted at SemEval-2022, consists of two different subtasks. The Subtask 1 is a binary classification problem. Namely, given a paragraph, a system must predict whether or not it contains any form of PCL. The Subtask 2 is a multi-label classification task. Given a paragraph, a system must identify which PCL categories express the condescension. A paragraph might contain one or more categories of PCL. To face with the first subtask we propose a multi-channel Convolutional Neural Network (CNN) and an Hybrid LSTM. Using the multi-channel CNN we explore the impact of parallel word emebeddings and convolutional layers involving different kernel sizes. With Hybrid LSTM we focus on extracting features in advance, thanks to a convolutional layer followed by two bidirectional LSTM layers. For the second subtask a Transformer BERT-based model (i.e. DistilBERT) and an XLNet-based model are proposed. The multi-channel CNN model is able to reach an F1 score of 0.2928, the Hybrid LSTM modelis able to reach an F1 score of 0.2815, the DistilBERT-based one an average F1 of 0.2165 and the XLNet an average F1 of 0.2296. In this paper, in addition to system descriptions, we also provide further analysis of the results, highlighting strengths and limitations. We make all the code publicly available and reusable on GitHub.