An end-to-end argument mining (AM) pipeline takes a text as input and provides its argumentative structure as output by identifying and classifying the argument units and argument relations in the text. In this work, we approach AM using fine-tuned large language models (LLMs). We model the three main sub-tasks of the AM pipeline, as well as their joint formulation, as text generation tasks. We fine-tune eight popular quantized and non-quantized LLMs – LLaMA-3, LLaMA-3.1, Gemma-2, Mistral, Phi-3, Qwen-2 – which are among the most capable open-weight models, on the benchmark PE, AbstRCT, and CDCP datasets that represent diverse data sources. Our approach achieves state-of-the-art results across all AM sub-tasks and datasets, showing significant improvements over previous benchmarks.
Argument Mining (AM) aims to extract the complex argumentative structure of a text and Argument Type Classification (ATC) is an essential sub-task of AM. Large Language Models (LLMs) have shown impressive capabilities in most NLP tasks and beyond. However, fine-tuning LLMs can be challenging. In-Context Learning (ICL) has been suggested as a bridging paradigm between training-free and fine-tuning settings for LLMs. In ICL, an LLM is conditioned to solve tasks using a few solved demonstration examples included in its prompt. We focuse on AM in the biomedical AbstRCT dataset. We address ATC using quantized and unquantized LLaMA-3 models through zero-shot learning, in-context learning, and fine-tuning approaches. We introduce a novel ICL strategy that combines $k$NN-based example selection with majority vote ensembling, along with a well-designed fine-tuning strategy for ATC. In zero-shot setting, we show that LLaMA-3 fails to achieve acceptable classification results, suggesting the need for additional training modalities. However, in our ICL training-free setting, LLaMA-3 can leverage relevant information from only a few demonstration examples to achieve very competitive results. Finally, in our fine-tuning setting, LLaMA-3 achieves state-of-the-art performance on ATC task in AbstRCT dataset.