VivesDebate-Speech: A Corpus of Spoken Argumentation to Leverage Audio Features for Argument Mining

In this paper, we describe VivesDebate-Speech , a corpus of spoken argumentation created to leverage audio features for argument mining tasks. The creation of this corpus represents an important contribution to the intersection of speech processing and argument mining communities


Introduction
The automatic analysis of argumentation in human debates is a complex problem that encompasses different challenges such as mining, computationally representing, or automatically assessing natural language arguments.Furthermore, human argumentation is present in different mediums and domains such as argumentative monologues (e.g., essays) and dialogues (e.g., debates), argumentation in text (e.g., opinion pieces or social network discussions) and speech (e.g., debate tournaments or political speeches), and domains such as the political [1,2], legal [3], financial [4], or scientific [5,6] among others.Thus, human argumentation presents a linguistically heterogeneous nature that requires us to carefully investigate and analyse all these variables in order to propose and develop argumentation systems which are robust to these variations in language.In addition to this heterogeneity, it is worth mentioning that a vast majority of the publicly available resources for argumentation-based Natural Language Processing (NLP) have been created considering text features only, even if their original source comes from speech [7,8].This is a substantial limitation, not only for our knowledge on the impact that speech may directly have when approaching argument-based NLP tasks, but because of the significant loss of information that happens when we only take into account the text transcript of spoken argumentation.
In this work, we will focus on the initial steps of argument analysis considering acoustic features, namely, the automatic identification of natural language arguments.Argument mining is the area of research that studies this first step in the analysis of natural language argumentative discourses, and it is defined as the task of automatically identifying arguments and their structures from natural language inputs.As surveyed in [9], argument mining can be divided into three main sub-tasks: first, the segmentation of natural language spans relevant for argumentative reasoning (typically defined as Argumentative Discourse Units ADUs [10]); second, the classification of these units into finer-grained argumentative classes (e.g., major claims, claims, or premises [11]); and third, the identification of argumentative structures and relations existing between these units (e.g., inference, conflict, or rephrase [12]).Therefore, our contribution is twofold.First, we create a new publicly available resource for argument mining research that enables the use of audio features for argumentative purposes.Second, we present first-of-theirkind experiments showing that the use of acoustic information improves the performance of segmenting ADUs from natural language inputs (both audio and text).
The rest of the paper is structured as follows.Section 2 describes the VivesDebate-Speech corpus.Section 3 provides a thorough description of the problem approached in this work.Section 4 explains the proposed methodology used to integrate audio features into the argument mining process.Section 5 reports and analyses the results observed during our experimentation.Lastly, Section 6 summarises the most important findings from our experiments.

The VivesDebate-Speech Corpus
The first step in our research was the creation of a new natural language argumentative corpus.In this work, we present VivesDebate-Speech, an argumentative corpus created to leverage audio features for argument mining tasks.The VivesDebate-Speech has been created taking the previously annotated Vives-Debate corpus [13] as a starting point.
The VivesDebate corpus contains 29 professional debates in Catalan, where each debate has been comprehensively annotated.This way, it is possible to capture longer-range dependencies between natural language ADUs, and to keep the chronological order of the complete debate.Although the nature of the debates was speech-based argumentation, the Vives-Debate corpus was published considering only the textual features included in the transcriptions of the debates that were used during the annotation process.In this paper, we have extended the VivesDebate corpus with its corresponding argumentative speeches in audio format.In addition to the speech features, we also created and released the BIO (i.e., Beginning, Inside, Outside) files for approaching the task of automatically identifying ADUs from natural language inputs (i.e., both textual and speech).The BIO files allow us to determine whether a word is the Beginning, it belongs Inside, or it is Outside an ADU.
The VivesDebate-Speech corpus is, to the best of our knowledge, the largest publicly available resource for spoken argument mining.Furthermore, combined with the original Vives-Debate corpus, a wider range of NLP tasks can be approached taking the new audio features into consideration (e.g., argument evaluation or argument summarisation).Compared to the size of the few previously available speech-based argumentative cor- pora [14,15] (i.e., 2 and 7 hours respectively), the VivesDebate-Speech represents a significant leap forward (i.e., more than 12 hours) for the research community (see Table 1).Therefore, the VivesDebate-Speech corpus consists of two parts.First, the text BIO files allow us to determine which span of the natural language debate is an ADU and which is not.Second, using these BIO files and the text transcriptions of the debates, we have been able to align them with the audio of the debate and produce a set of .wavfiles containing the audio features of each ADU.The VivesDebate-Speech is released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0) and can be publicly accessed from Zenodo1 .

Text
The text-based part of the VivesDebate-Speech corpus consists of 29 BIO files where each word in the debate is labelled with a BIO tag.This way, the BIO files created in this work enable the task of automatically identifying argumentative natural language sequences existing in the complete debates annotated in the VivesDebate corpus.Furthermore, these files represent the basis on which it has been possible to achieve the main purpose of the VivesDebate-Speech corpus, i.e., to extend the textual features of the debates with their corresponding audio features.We created the BIO files combining the transcriptions and the ADU annotation files of the VivesDebate2 corpus.For that purpose, we performed a sequential search of each annotated ADU in the transcript file of each corresponding debate, bringing into consideration the chronological order of the annotated ADUs.

Speech
Once the revised transcription has been augmented with the ADU information, the transcription was force-aligned with the audio in order to obtain word level timestamps.This process was carried out using the hybrid DNN-HMM system that was previously used to obtain the VivesDebate transcription, implemented using TLK [16].As a result of this process, we have obtained (start,end) timestamps for every (word,label) pair.We split the corpus into train, dev, and test considering the numerical order of the files (i.e., Debate1-23 for train, Debate24-26 for dev, and Debate27-29 for test).The statistics for the final VivesDebate-Speech corpus are shown in Table 1.

Problem Description
For understanding argumentative discourses, the first phases of the human argumentative reasoning process are the identifica-tion and the analysis of natural language arguments [17].From a computational viewpoint, these phases are typically framed into the argument mining area of research, which investigates how to identify and analyse arguments in natural language inputs.In this paper, we approach the segmentation and identification of ADUs in natural language debates.Furthermore, we approach this problem considering both, text-based and audiobased natural language features.The inclusion of audio-based natural language features into the argument mining pipeline extends the scarce previous existing research in this topic [14,15], and makes possible to explore this new dimension, which has not been typically addressed from the computational viewpoint, but that represents an important source of information for the human argumentative reasoning process.
Therefore, we approached the identification of natural language ADUs in two different ways: (i) as a token classification problem, and (ii) as a sequence classification problem.For the first approach, we analyse our information at the token level.Each token is assigned a BIO label and we model the probabilities of a token belonging to each of these specific label considering the n-length closest natural language contextual tokens.For the second approach, the information is analysed at the sentence level.In order to address the ADU identification task as a sequence classification problem we need to have a set of previously segmented natural language sequences.Then, the problem is approached as a 2-class classification task, discriminating argumentative relevant from non-argumentative natural language sequences.In the following section, we present our proposal to tackle this problem considering both approaches, token and sequence level analysis of natural language inputs.

Proposed Method
The use of audio information for argument mining presents significant advantages across 3 axes: efficiency, information and error propagation.Firstly, the segmentation of the raw audio into independent units is a pre-requisite for most Automatic Speech Recognition (ASR) system.If the segmentation produced in the ASR step is incorporated into the argument mining pipeline, we remove the need for a specific text-segmentation step, which brings significant computational and complexity savings.Secondly, the use of audio features allows us to take into account relevant prosodic features such as intonation and pauses which are critical for discourse segmentation, but that are missing from a text-only representation.Lastly, the use of ASR transcriptions introduces noise into the pipeline as a result of recognition errors, which can hamper downstream performance.Working directly with the speech signal allows us to avoid this source of error propagation.
Two different methods to leverage audio features for argument mining are explored in this paper.First, a standard endto-end (E2E) approach that takes the text-based transcription of the spoken debate produced by the ASR as an input, and directly outputs the segmentation of this text into argumentative units.Second, we propose a cascaded model composed of two sub-tasks: argument segmentation and argument classification.In the first sub-task, the discourse is segmented into independent units, and then for each unit it is determined if it contains argumentative information or not.Both approaches produce an equivalent output, a sequence of BIO tags which is then compared against the reference.This work investigates how audio features can be best incorporated into the previously described process.An overview of the proposed cascaded method is shown in Figure 1.As we can observe, a segmentation step follows the ASR step, which segments either the whole audio (A-Seg) or the whole text (T-Seg) into (potential) argumentative segments.A classification step then detects if the segment contains an argumentative unit, by using either the audio (A-Clf) or the text (T-Clf) contained in each segment.If efficiency is a major concern, the segmentation produced by the ASR step can be re-used instead of an specific segmentation step tailored for argumentation, but this could decrease the quality of the results.This process significantly differs from the E2E approach where the BIO tags are directly generated from the output of the ASR step.This way, our cascaded model is interesting because it makes possible to analyse different combinations of audio and text features.
The cascaded method has one significant advantage, which is that audio segmentation is a widely studied problem in ASR and Speech Translation (ST) for which significant breakthroughs have been achieved in the last few years.Currently, one of the best performing audio segmentation methods is SHAS [18], which uses a probabilistic Divide and Conquer (DAC) algorithm to obtain optimal segments.Furthermore, we have compared SHAS with a Voice Activity Detection (VAD) baseline, as well as with the non-probabilistic VAD method [19] using a Wav2Vec2 pause predictor [20], which performs ASR inference and then splits based on detected word boundaries.To complete our proposal, we have also explored text-only segmentation methods in which a Transformer-based model is trained to detect boundaries between natural language segments.This way, each word can belong to two different classes, boundary or not boundary.
The second stage of our cascaded method is an argument detection classifier, that decides, for each segment, if it includes argumentative content and should be kept, or be discarded oth-erwise.In the case that our classifier detects argumentative content within a segment, its first word is assigned the B label (i.e., Begin) and the rest of its words are assigned the I label (i.e., Inside).Differently, if the classifier does not detect argumentative content within a segment, all the words belonging to this segment are assigned the O label (i.e., Outside).

Experimental Setup
Regarding the implementation of the text-based sub-tasks (see Figure 1, green modules) of our cascaded method for argument mining, we have used a RoBERTa3 architecture pre-trained in Catalan language data.For the segmentation (T-Seg), we experimented with a RoBERTa model for token classification, and we used the segments produced by this model to measure the impact of the audio features compared to text in the segmentation part of our cascaded method.For the classification (T-Clf), we finetuned a RoBERTa-base model 4 for sequence classification with a two-class classification softmax function able to discriminate between the argument and the non-argument classes.
The implementation of the audio-based sub-tasks (see Figure 1, blue modules) is quite different between segmentation to classification.For audio-only segmentation (A-Seg), we performed a comparison between the selected algorithms: VAD, DAC, and SHAS.For the hybrid DAC segmentation, two Catalan W2V ASR models are tested, xlsr-53 5 and xls-r-1b 6 .For SHAS, we used the original SHAS Spanish and multilingual checkpoints.Additionally, we trained a Catalan SHAS model with the VivesDebate-Speech train audios, as there exists no other public dataset that contains the unsegmented audios needed for training a SHAS model.Regarding our audio-only classifier (A-Clf), Wav2Vec27 models have been finetuned on the sequence classification task (i.e., argument/non-argument).
As for the training parameters used in our experiments with the RoBERTa models, we trained them for 50 epochs, considering a learning rate of 1e-5, and a batch size of 128 samples.The best model among the 50 epochs was selected based on the performance in the dev set.All the experiments have been carried out using an Intel Core i7-9700k CPU with an NVIDIA RTX 3090 GPU and 32GB of RAM.

Evaluation
In order to evaluate the performance of the proposed methods, we use the reference transcriptions and timestamps of the VivesDebate-Speech as input to our models.This is necessary in order to be able to compare the system hypothesis with the reference labels.If a real ASR system had been used, a different transcription would have been obtained, and we would need a way of mapping the BIO tags of the real, noisy transcription with those of the reference.There is no reliable way to obtain such a mapping, as the BIO-annotated transcription and the reference will have different lengths and consist of different words.
Table 2 shows the best performance on the dev set of the different audio segmentation methods tested.Results are reported without using an argument classifier, which is equivalent to a majority class classifier baseline, as well as an oracle classifier  which assigns the most frequent class (based on the reference labels) to a segment.This allows us to analyze the upper-bound of performance that could be achieve with a perfect classifier.
The results highlight the strength of the SHAS method, with the SHAS-es and SHAS-multi models which are working on a zero-shot scenario, outperforms the Catalan W2V models.The SHAS-ca model had insufficient training data to achieve parity with the zero-shot models trained on larger audio collections.As a result of this, the SHAS-multi model was selected for the rest of the experiments.
One key factor to study is the relationship between the maximum segment length (in seconds) produced by the SHAS segmenter, and the performance of the downstream classifier.Longer segments provide more context that can be helpful to the classification task, but a longer segment might contain a significant portion of both argumentative and non-argumentative content.Figure 2 shows the performance of the text classifier as a function of segment size, measured on the dev set. 5 seconds was selected as the maximum sentence length, as shorter segments did not improve results.
Once the hyperparameters of each individual model have been optimised, the best results for each system combination are reported on Table 3.The results are consistent across the dev and test sets.The end-to-end model outputs the BIO tags directly, either from fixed length input (of which 5 was also the best performing value), denoted as E2E BIO-5.Alternatively, the E2E BIO-A was trained considering the natural lan- guage segments produced by the SHAS-multi model instead of relying on a specific maximum length defined without any linguistic criteria.This way, it was our objective to improve the training of our end-to-end model through the use of linguistically informed audio-based natural language segments.It can be observed how this second approach leverages the audio information to improve test macro-F1 from 0.47 to 0.49.For the cascade model, we test both audio and text segmenters and classifiers.Similarly to the E2E case, the use of audio segmentation consistently improves the results.For the text classifier, moving from text segmentation (T-Seg + T-Clf ) to audio segmentation (A-Seg + T-Clf ) increases test macro-F1 from 0.49 to 0.51.Likewise, when using an audio classifier (A-Clf), audio segmentation improves the test results from 0.41 to 0.43 macro-F1.However, the relatively mediocre performance of the audio classification models with respect to its text counterparts stands out.Although the use of an audio classifier is significantly better than the majority baseline (see Table 2), there is still quite a gap with the performance achieved by the text models.We believe this could be caused due to the fact that speech classification is a harder task than text classification in our setup, because the audio classifier deals with the raw audio, whereas the text classifier uses the reference transcriptions as input.Additionally, the pre-trained audio models might not be capable enough for some transfer learning tasks, as they have been trained with a significantly lower number of tokens, which surely hampers language modelling capabilities.

Conclusions
We have presented the VivesDebate-Speech corpus, which is currently the largest speech-based argument corpus.This opens up exciting research opportunities for more realistic argument mining experiments taking into account audio information.
The experiments have shown how having access to audio information can be a source of significant improvement.Specifically, using audio segmentation instead of text-based segmentation consistently improves performance, both for the text and audio classifiers used in the cascade approach, as well as in the end-to-end scenario, where audio segmentation is used as a decoding constraint for the model.
In terms of future research, an exciting research direction is to better understand the underperformance of the audio-based classifiers, as well as devising new techniques for bridging the gap with the results obtained by text classifiers.

Figure 1 :
Figure 1: Overview of the proposed cascaded approach.

Figure 2 :
Figure 2: Dev set F1 score as a function of maximum segment length (s), SHAS-multi segmenter followed by text classifier.

Table 1 :
Set-level statistics of the VivesDebate-Speech corpus.Each debate is carried out between two teams, and two to four members of each team participate as speakers in the debate.

Table 2 :
Audio segmentation methods performance on the dev set, as measured by accuracy (Acc.) and Macro-F1.

Table 3 :
Accuracy and Macro-F1 results of the argumentative discourse segmentation task on both dev and test sets.