2024
pdf
bib
abs
BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense
Baktash Ansari
|
Mohammadmostafa Rostamkhani
|
Sauleh Eetemadi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think ‘outside of the box’. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a ‘round table conference’ approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.
pdf
bib
abs
eagerlearners at SemEval2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure
Hoorieh Sabzevari
|
Mohammadmostafa Rostamkhani
|
Sauleh Eetemadi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
This study investigates the performance of the zero-shot method in classifying data using three large language models, alongside two models with large input token sizes and the two pre-trained models on legal data. Our main dataset comes from the domain of U.S. civil procedure. It includes summaries of legal cases, specific questions, potential answers, and detailed explanations for why each solution is relevant, all sourced from a book aimed at law students. By comparing different methods, we aimed to understand how effectively they handle the complexities found in legal datasets. Our findings show how well the zero-shot method of large language models can understand complicated data. We achieved our highest F1 score of 64% in these experiments.
pdf
bib
abs
ROSHA at SemEval-2024 Task 9: BRAINTEASER A Novel Task Defying Common Sense
Mohammadmostafa Rostamkhani
|
Shayan Mousavinia
|
Sauleh Eetemadi
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In our exploration of SemEval 2024 Task 9, specifically the challenging BRAINTEASER: A Novel Task Defying Common Sense, we employed various strategies for the BRAINTEASER QA task, which encompasses both sentence and word puzzles. In the initial approach, we applied the XLM-RoBERTa model both to the original training dataset and concurrently to the original dataset alongside the BiRdQA dataset and the original dataset alongside RiddleSense for comprehensive model training.Another strategy involved expanding each word within our BiRdQA dataset into a full sentence. This unique perspective aimed to enhance the semantic impact of individual words in our training regimen for word puzzle (WP) riddles. Utilizing ChatGPT-3.5, we extended each word into an extensive sentence, applying this process to all options within each riddle.Furthermore, we explored the implementation of RECONCILE (Round-table conference) using three prominent large language models—ChatGPT, Gemini, and the Mixtral-8x7B Large Language Model (LLM). As a final approach, we leveraged GPT-4 results. Remarkably, our most successful experiment yielded noteworthy results, achieving a score of 0.900 for sentence puzzles (S_ori) and 0.906 for word puzzles (W_ori).
2023
pdf
bib
abs
ROZAM at SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis
Mohammadmostafa Rostamkhani
|
Ghazal Zamaninejad
|
Sauleh Eetemadi
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
We build a model using large multilingual pretrained language model XLM-T for regression task and fine-tune it on the MINT (Multilingual INTmacy) analysis dataset which covers 6 languages for training and 4 languages for testing zero-shot performance of the model. The dataset was annotated and the annotations are intimacy scores. We experiment with several deep learning architectures to predict intimacy score. To achieve optimal performance we modify several model settings including loss function, number and type of layers. In total, we ran 16 end-to-end experiments. Our best system achieved a Pearson Correlation score of 0.52.