HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis

Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform Authorship Attribution (AA)&Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models. While SOTA methods, such as, character ngram or Transformer-based model, exhibit similar AA&AV performance in human-spoken datasets compared to written ones, there is much room for improvement in AI-generated spoken text detection. The HANSEN benchmark is available at: https://huggingface.co/datasets/HANSEN-REPO/HANSEN.


Introduction
Authorship analysis is a longstanding research area in NLP that has garnered significant attention over the years.Two tasks are at the core of authorship analysis -Authorship Attribution (AA) and Authorship Verification (AV).AA is the process of identifying the author of a text document (Kjell et al., 1994).Similarly, AV determines whether two documents are written by the same author (Halteren, PROMPT: Create a post match interview with tennis player Roger Federer who . . .<rest of the prompt>.The reporter will ask exactly the following questions: Q:. . . . . . . . . . . . . . . . . . . . . . . . . . .Q: What do you think the Williams sisters' rivalry with one another has meant to the sport?HUMAN: I don't know.I'd say the players themselves have been great for the sport.I don't know if the matches themselves is what people remember that much.ChatGPT: The Williams sisters have been amazing for tennis, and their rivalry has been a huge part of that.They've both achieved so much in their careers, and they've pushed each other to be better.PaLM2: The Williams sisters' rivalry with one another has been one of the most compelling and enduring in tennis history.The rivalry has meant a great deal to the sport of tennis.Vicuna13B: The Williams sisters' rivalry has been incredible for the sport.It's been amazing to see the two of them compete against each other and push each other to be their best.2007).Both have been extensively researched as text classification problems due to their substantial impacts on various applications, such as author profiling, author forensic analysis, resolving copyright disputes, and similar issues (Neal et al., 2017).Currently, most of the research in NLP text classification focuses on written text due to vast amounts of text data available for training and evaluation (Kowsari et al., 2019).However, "text" is also inherent in spoken language as a textual representation of what individuals say, commonly known as spoken text (Biber, 1991).Although spoken language has always existed before written language (i.e., considering human history of language as a twelve-inch ruler, written language has only existed for the "last quarter of an inch" (Wrench et al., 2008)), it has not received much attention from the NLP community.Simultaneously, recent advancements in speech-to-text technology have

Spoken example
Remarks Today, the Commission considers adopting a final rule to enhance the disclosures related to share buybacks.I support this final rule because it will increase[...] First, the rule will require issuers to disclose periodically the prior period's daily buyback activity.This will include such information as the date of the purchase, the amount of shares repurchased, and the average purchase price for the date.[...] Second, issuers are required by the rule to provide disclosure about their buyback programs.Such disclosure will include details about the objectives or rationales for the buyback as well as the process [...] Good morning.I am pleased to join the Investor Advisory Committee.[...] I look forward to hearing about your potential recommendation today regarding customer account statements.We look forward to your input about private markets, [...] We also look forward to your discussion on open-end funds.Now, let me turn to your panel on the oversight of investment advisers.I am glad you are discussing advisers, because [...] Written statement and spoken speech from US Security Exchange (SEC) Commission Chairperson Gary Gensler about corresponding issues.
Different people naturally have differing life experiences and differing viewpoints -which inevitably results in varying agendas when implementing the law according to their own personal discretion.In this respect -Birks is correct that an over-reliance on conscience alone to provide equitable solutions inevitably leads to a messy framework of case law that would result in judges navigating uncertainly by feel than by the solid path of precedent... So I do want to try steak, yeah.I think it's[...] I-I don't think it's bad, I think it's a good thing.Erm, because I-I did try be veggie for, I think it was six months I did it, but I gained a lot of weight.I think probably because I didn't really know the other options I could have ate instead.I think I was eating more carbs and stuff like that.So I was gaining weight so I stopped [...] Written essay and informal interview transcript from a sample participant in Aston Idiolects Corpus.
Figure 2: Written and spoken text samples from same individual.Written texts contain linking words, passive voice (Akinnaso, 1982), complicated words, and complex sentence structures.However, spoken texts contain more first & second person usages, informal words (Brown et al., 1983), grammatically incorrect phrases (Biber et al., 2000), repetitions, and other differences.Sentences in spoken texts are also shorter (Farahani et al., 2020).
expanded the availability of spoken text corpora, enabling researchers to explore new avenues for NLP research focused on spoken language.
Identification of speakers from speech has been successfully addressed through various audio features, with or without the availability of associated text information (Kabir et al., 2021).Whether speaker identification can be achieved solely based on how individuals speak, as presented in spoken text, remains mostly unexplored in the existing literature.However, the rise of audio podcasts and short videos in the digital era, driven by widespread social media usage, has increased the risk of plagiarism as individuals seek to imitate the style and content of popular streamers.Therefore, spoken text authorship analysis can serve as a valuable tool in addressing this situation.It can be challenging for several reasons.First, text is not the only message we portray in speaking since body language, tone, and delivery also determine what we want to share (Berkun, 2009).Second, spoken text is more casual, informal, and repetitive than written text (Farahani et al., 2020) Third.the word choice and sentence structure also differ for spoken text (Biber, 1991).For example, Figure 2 portrays the difference between written and spoken text from the same individuals.Therefore, discovering the author's style from the spoken text can be an exciting study leveraging the current advanced AA and AV techniques.
On top of classical plagiarism detection problems, there is a looming threat of synthetically generated content from LLMs.Therefore, using LLMs, such as ChatGPT, for generating scripts for speech, podcasts, and YouTube videos is expected to become increasingly prevalent.Figure 1 exemplifies how various LLMs can adeptly generate replies to an interview question.While substantial research efforts are dedicated to discerning between human and AI-generated text, also known as Turing Test (TT) (Uchendu et al., 2023), evaluating these methods predominantly occurs in the context of written texts, such as the Xsum (Narayan et al., 2018) or SQuAD (Rajpurkar et al., 2016) datasets.Thus, spoken text authorship analysis and detecting AIgenerated scripts will be crucial for identifying plagiarism & disputing copyright issues in the future.
To tackle these problems, we present HANSEN (Human ANd ai Spoken tExt beNchmark), a benchmark of human-spoken & AI-generated spoken text datasets, and perform three authorship analysis tasks to better understand the limitations of existing solutions on "spoken" texts.In summary, our contributions are as follows.
• We compile & curate existing speech corpora and create new datasets, combining 17 human-spoken datasets for the HANSEN benchmark.• We generate spoken texts from three popular, recently developed LLMs: ChatGPT, PaLM2 (Anil et al., 2023), and Vicuna-13B (Chiang et al., 2023) (a fine-tuned version of LLaMA (Touvron et al., 2023) on user-shared conversations, combining ∼21K samples in total, and evaluate their generated spoken text quality.• We assess the efficacy of traditional authorship analysis methods (AA, AV, & TT) on these HANSEN benchmark datasets.

Related Work
Authorship analysis: Over the last few decades, both AA and AV have been intensively studied with a wide range of features and classifiers.Ngram representations have been the most common feature vector for text documents for a long time (Kjell et al., 1994).Various stylometry features, such as lexical, syntactic, semantic, and structural are also popular in authorship analysis (Stamatatos, 2009;Neal et al., 2017).While machine learning (ML) & deep learning (DL) algorithms with n-gram or different stylometry feature sets have been popular in stylometry for a long time (Neal et al., 2017), Transformers have become the stateof-the-art (SOTA) model for AA and AV, as well as many other text classification tasks (Tyo et al., 2022).Transformers, such as BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) and GPT (Generative Pre-trained Transformer) (Radford et al., 2019), are generally pre-trained on large amounts of text data before fine-tuning for specific tasks (Wang et al., 2020), such as AA & AV.Also, different word embeddings, such as GLOVE (Pennington et al., 2014) and FastText (Bojanowski et al., 2017), can capture the semantic meaning between words and are employed with DL algorithms for AA as well.
LLM text detection: LLMs' widespread adoption and mainstream popularity have propelled research efforts toward detecting AI-generated text.Supervised detectors work by fine-tuning language models on specific datasets of both human & AIgenerated texts, such as the GPT2 detector (Solaiman et al., 2019), Grover detector (Zellers et al., 2019), ChatGPT detector (Guo et al., 2023), and others.However, they only perform well within the specific domain & LLMs for which they are finetuned (Uchendu et al., 2023;Pu et al., 2023).Statistical or Zero-shot detectors can detect AI text without seeing previous examples, making them more flexible.Therefore, it has led to the development of several detectors, including GLTR (Gehrmann et al., 2019), DetectGPT (Mitchell et al., 2023), GPT-zero (Tian, 2023), and watermarking-based techniques (Kirchenbauer et al., 2023).
Style from speech: While some studies have attempted to infer the author's style from speech, they primarily focus on the speech emotion recognition task and rely on audio data.These works (Shinozaki et al., 2009;Wongpatikaseree et al., 2022) Figure 3: HANSEN benchmark framework utilize various speech features, and DL approaches to recognize the speaker's emotions, such as anger or sadness, and subsequently define the speaking style based on these emotional cues.
3 Building the Benchmark: HANSEN Existing conversational toolkits and datasets, such as ConvoKit (Chang et al., 2020), prioritize analyzing social phenomena in conversations, thus limiting their applicability in authorship analysis.Similarly, current LLM conversational benchmarks, including HC3 (Guo et al., 2023), ChatAlpaca (Bian et al., 2023), and XP3 (Muennighoff et al., 2023), primarily focus on evaluating LLMs' questionanswering and instruction-following abilities rather than their capability to generate authentic spoken language in conversations or speeches.Therefore, we introduce the HANSEN datasets to address this gap, incorporating human and AI-generated spoken text from different scenarios as portrayed in Figure 3, thereby providing a valuable resource for authorship analysis tasks.The datasets of HANSEN are available through Python Hugging Face library.

Human spoken text datasets
The HANSEN benchmark comprises 17 human datasets, where we utilize several existing speech corpora and create new datasets.Our contribution involves curating and preparing these datasets for authorship analysis by leveraging metadata information, aligning spoken text samples with their respective speakers, and performing necessary postprocessing.Table 1 shows the summary of the human datasets.
Existing datasets selection: When constructing the HANSEN benchmark datasets, several criteria guided our selection of speech corpora.Firstly, we consider datasets with readily available transcripts (e.g., TED, BNC, BASE) or the ability to It is a collection of English-language Spotify podcast episodes transcripts compiled by (Clifton et al., 2020).

Speech/ Conversation Academic
The British Academic Spoken English (BASE) corpus (Thompson and Nesi, 2001) comprises lectures and seminars recorded in a university.

Mostly unscripted & informal discussion
All utterance in a session lecture/ seminar) 523 1.66M

Daily life topics
We utilize the spoken portion of the British National Corpus (BNC) (Consortium et al., 2007).
It is a collection of late-twentieth century British English language.

Unscripted informal conversations
All utterance in a conversation 2461 53.6M

Conversation Daily life topics
A contemporary version of the previous corpus (Love et al., 2017).Both datasets are gathered from various real-life contexts.

Unscripted & informal
All utterance in a conversation 459 4.65M

MSU Switchboard
Conversation pre-defined topics The MSU Switchboard Dialog Act Corpus (Stolcke et al., 2000) includes phone calls between two participants.The callers ask receivers questions about child care, recycling, the news media, and other provided topics.extract transcripts (e.g., Voxceleb (Nagrani et al.) via YouTube automated transcriptions).Secondly, speaker information for each sample was required, leading to the exclusion of corpora such as Coca (Davies, 2015).Finally, we also emphasize datasets where speakers did not recite identical content or scripts, excluding datasets like Ljspeech (Ito and Johnson, 2017) and Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011) for their lack of spontaneity & originality in representing natural spoken language.

Unscripted
New dataset creation: We have introduced three new datasets within our benchmark 1 .The SEC dataset comprises transcripts from the Security Exchange Commission (SEC) website, encompassing both spoken (speech) and written (statements) content from various personnel in the commission.
The Face the Nation (FTN) dataset compiles transcripts from the Face the Nation talk show by CBS News, where episodes feature discussions on contemporary topics with multiple guests.The CEO interview dataset contains interviews with financial personnel, CEOs of companies, and stock market associates assembled from public transcripts provided by CEO-today magazine, Wall Street Journal, and Seeking Alpha.For authorship analysis, we retained only the answers from the interviewees, excluding the host's questions.
Dataset curating: We have followed a systematic process for all datasets in our benchmark preparation.It involves eliminating tags, URLs, & metadata from the text, and aligning each utterance with its respective speaker for generating labeled data for authorship analysis.To maintain consistent text lengths and contextual coherence within utterances, we adjust the span of each sample, as detailed in Table 1.In cases where a sample was excessively long, such as an lengthy TED talk, we split it into smaller segments to ensure uniform lengths.Furthermore, we partition each dataset into training, validation, and testing subsets while preserving a consistent author ratio across these partitions.(Sari et al., 2018), the benchmark comprises datasets covering specific or diverse topics.

AI (LLM)-generated spoken text dataset
With the growing societal impact of LLMs, we will soon observe their extensive use in generating scripts for speech and guidelines for interviews or conversations.Thus, it motivates us to include AI-generated spoken text in HANSEN.Since chatbased LLMs can follow instructions and generate more conversation-like text than traditional LLMs (Ouyang et al., 2022), we have utilized three recent prominent chat-based LLMs: ChatGPT (gpt3.5turbo),PaLM2 (chat-bison@001), and Vicuna-13B in our study.We also ensure the spoken nature of the generated texts and evaluate their quality.
Spoken text generation: LLMs are predominantly trained on written text from diverse sources, including BookCorpus (Zhu et al., 2015), Open WebText (Aaron Gokaslan*, 2019), and Wikipedia (Merity et al., 2017), containing a small portion of spoken texts, while the exact proportion is unknown.Therefore, effective prompt engineering is crucial for generating spoken text.
Different metadata associated with each speaker and text samples can be utilized to construct efficient prompts.For instance, by leveraging elements like talk descriptions, speaker details, and talk summaries, LLMs can produce more coherent TED talk samples rather than providing the topic only.To create spoken texts from LLMs, we utilize subsets from five human datasets (TED, SEC, Spotify, CEO, & Tennis), chosen for their rich metadata and the involvement of notable public figures as speakers.Additional insights regarding prompt construction, dataset selection, and related attributes can be found in the Appendix.Evaluating spoken nature of AI-generated texts: To ensure the spoken nature of AI-generated texts, corresponding written texts in the same context are essential.We select subsets for each category and instruct LLMs to generate written articles based on their respective spoken texts.In HANSEN, only two human datasets (SEC and PAN) provide written and spoken samples from the same individuals.Stylometry properties of these samples from human and LLM-generated datasets are compared, and a student t-test (Livingston, 2004) was used to determine if there were significant differences in these stylometric features between written and spoken texts.The results (Table 3) indicate that LLM-generated spoken and written text exhibits similar trends to human datasets, including variations in word lengths, first/second person counts, vocabulary richness, and other linguistic features, thus validating that LLMs have successfully generated spoken text with a spoken-like quality in most scenarios.
ChatGPT and Vicuna13B demonstrate similar differences compared to humans, whereas PaLM 2 shows minor differences in some cases, indicating the need for more training with human-spoken texts.It is essential to consider the distinct nature of the SEC and PAN datasets; SEC speeches and statements are rigorously scripted and reviewed due to their potential impact on financial markets and the economy (Burgoon et al., 2016), while the PAN dataset comprises mainly informal spoken and written texts.Additional detailed findings are available in the Appendix.
Quality of AI-generated spoken text: Additionally, we use different automatic evaluation metrics, such as BERT score (Zhang* et al., 2020), GRUEN (Zhu and Bhat, 2020), MAUVE (Pillutla et al., 2021), text diversity score, and readability score (Kincaid et al., 1975) to evaluate quality of the generated text (Figure 4).In interview datasets, the BERT scores for all LLMs are consistently higher than 0.8 (out of 1.0) and mostly similar, as the original questions guide the generated texts.However, for open-ended speech datasets like TED and Spotify, PaLM2 shows lower BERT scores compared to other models.The GRUEN scores across all datasets are uniformly distributed, indicating variation in linguistic quality among individualgenerated samples.PaLM2 exhibits lower GRUEN scores than the other models.Surprisingly, we observe lower text diversity for human texts, contradicting existing AI vs. human analyses (Guo et al., 2023) but supporting the repetitive nature of human-spoken texts (Farahani et al., 2020).

Authorship Analysis on HANSEN
We conduct three authorship analysis tasks using HANSEN and evaluate how existing SOTA methods perform in the context of "spoken" texts.Details on AA, AV, & TT methods are provided in Tables 13, 14, 15 in the Appendix section.We run the experiments five times and report the average.

Author Attribution (AA)
Author Attribution (AA) is a closed-set multi class classification problem that, given a spoken text T , identifies the speaker from a list of candidate speakers.We have primarily performed AA on human datasets as well as considering each LLM as an individual speaker.We have utilized N-grams (character & word), Stylometry features (WritePrint (Abbasi and Chen, 2008) + LIWC (Pennebaker et al., 2001)), FastText word embeddings with LSTM, BERT-AA (Fabien et al., 2020), Finetuned BERT (BERT-ft) as AA methods.
Additionally, we have evaluated the realistic scenario where each LLM performs as an individual speaker and performs AA on ten speakers (7 humans + 3 LLMs).Also, texts from the same LLM are generated using a specific prompt (prompting them to symbolize the actual human persona).Thus, LLM is supposed to exhibit a different persona in all its instances.Therefore, we have performed an ablation study using samples from the top 5 human speakers (with most samples) and their corresponding samples generated by ChatGPT, Vi-cuna13B, and PaLM2, resulting in a total of 20 classes for comparison.We employed conventional text classification evaluation metrics, including Accuracy, macro F1 score, Precision, Recall, and Area under the curve (AUC) score.Table 4, 5 and 6 present the macro F1 score and demonstrate that character (char) n-gram performs best in most scenarios, with BERT-ft being a close contender.Our observations align with the AA findings of Tyo et al. (2022) on written text datasets.However, the performance of BERT-AA is subpar when directly applied to spoken text datasets, but finetuning them improves performance substantially.

Author Verification (AV)
Author verification (AV) is a binary classification problem that, given a pair of spoken texts (T 1 , T 2 ), detects whether they were generated by the same speakers or different speakers.We have used PAN AV baselines: N-gram similarity, Predictability via Partial Matching (PPM), and current PAN SOTA methods: Adhominem (Boenninghoff et al., 2019), Stylometry feature differences (Weerasinghe et al., 2021), and finetuned BERT.We have used traditional PAN evaluation metrics for AV (Bevendorff et al., 2022), including F1, AUC, c@1, F_0.5u, and Brier scores.

Turing Test (TT) for Spoken Text
We frame the human vs AI spoken text detection problem as Turing Test (TT), a binary classification problem that, given a spoken text T , identifies whether it is from a human or AI (LLM).We have utilized several supervised and zero-shot detectors: OpenAI detector, Roberta-Large, DetectGPT, and GPT Zero in our study.vary in spoken text.Furthermore, the low recall of Roberta-Large, specifically for ChatGPT, may be attributed to the distinct nature of spoken texts generated by ChatGPT compared to the written texts from previous GPT models.

Discussion
Character n-gram dominates in AA but struggles in AV: While character n-gram performs best in AA for most datasets, it underperforms in the AV task.More intricate DL models, such as Adhominem, excel in AV, consistent with findings in written text datasets (Tyo et al., 2022).Larger DL models tend to outperform smaller models when the datasets have more words per class, leading to better AV performance since it only has two classes (Tyo et al., 2022).Character n-grams outperform word n-grams by a notable margin (5%-10% in general), emphasizing the potential enhancement of AA performance through the inclusion of informal words (e.g., "eh," "err," "uhh").
Dataset TED (S, mixed) Spotify (S, mixed) SEC (S, financial) CEO (I, financial) Tennis (I, sports) Methods gpt3.5 palm2 vicuna gpt3.5 palm2 vicuna gpt3.5 palm2 vicuna gpt3.5 palm2 vicuna gpt3.5 palm2 vicuna AA and AV performance is dataset specific: Our study highlights significant performance variations in AA & AV across datasets, influenced by factors such as dataset type, domain, and modality.Daily-life conversation datasets (BASE, BNC, PAN) generally yield high F1 scores, except for MSU, which comprises simulated conversations with predefined topics, potentially limiting speakers' natural speaking styles.Conversely, talk-showtype datasets (USP, FTN) exhibit poor AA & AV performance due to a skewed distribution of samples from show hosts and the influence of Communication Accommodation Theory (Giles, 1973), suggesting speech adaptation to host styles.Individuals written & spoken samples are vastly different: Our results affirm the distinctions between individuals' written and spoken texts, consistent with prior corpus-based research (Biber et al., 2000;Farahani et al., 2020).We also conduct an ablation study on SEC and PAN datasets, training on written texts only and testing on spoken texts (and vice versa).The results in Table 9 show a substantial decrease in the macro f1 score for both character n-gram and BERT-ft.Also, stylometry features exhibit poor performance, underscoring the stylistic differences between the two text forms and thus emphasizing the significance of separate authorship analysis for spoken text.
AI-generated spoken and written texts have different characteristics: Table 8 reveals a negative correlation between the Mauve score and the detection rate of LLMs, challenging the assumption that higher Mauve scores indicate harder-to-detect texts (Uchendu et al., 2023).Similarly, contrary to existing studies that highlight greater text diversity in humans compared to LLMs (Guo et al., 2023), we observe an opposite trend that humans tend to exhibit more repetitions in their speech.These findings suggest a further investigation into the specific characteristics of AI spoken language.
How close are LLM-generated spoken texts to humans?The AA-10(L) and AA-20() experiments reveal a decrease in the f1 score compared to AA with all human speakers, indicating that LLMgenerated spoken texts exhibit greater similarity to other speakers in the experiments.It highlights the potential for further research in training LLMs to replicate individual spoken styles accurately.
Which LLM is the winner?Identifying the best LLM for spoken text generation remains an open question, influenced by multiple factors.PaLM2 demonstrates a lower GRUEN & DIV score and is more easily detectable than other LLMs in all datasets.On the contrary, ChatGPT exhibits lower Mauve scores, indicating its generation outside of human distribution, but its higher average recall value exhibits that it is difficult to detect.

Conclusion
We present HANSEN, a benchmark of human and AI-generated spoken text datasets for authorship analysis.Our preliminary analysis suggests that existing SOTA AA & AV methods behave similarly for spoken texts in most scenarios.However, the stylistics difference in spoken & written for the same individuals and AI-generated spoken texts show different characteristics than existing notions, emphasizing the need for a more nuanced understanding of the spoken text.

Limitations
While the HANSEN benchmark encompasses multiple human and AI-generated spoken text datasets, it is essential to note that they are currently limited to English.Variations in spoken text structures and norms among humans differ significantly across languages (Crystal, 2007).Consequently, the performance of authorship analysis techniques may vary when applied to other languages (Halvani et al., 2016).Additionally, numerous spoken text datasets exist in various settings that are not yet included in the HANSEN benchmark.Furthermore, due to the computational demands of generating texts from LLMs, our study focuses on three specific LLMs while acknowledging the availability of other LLM models that could be explored in future research.
While our study focused on LLMs and their detection, it is important to acknowledge the ongoing arms race between the development of LLMs and LLM detectors (Mitchell et al., 2023).Therefore, our TT study may not encompass all LLM detectors, and it is possible that newly developed detectors outperform existing ones on these datasets.We deliberately refrained from finetuning any detectors on the spoken datasets as we believe that the evaluation of detectors should be conducted in open-ended scenarios.Although human evaluation of AI-generated texts still presents challenges (Clark et al., 2021), we consider it a crucial area for future work.Given the subjective nature of speech (Berkun, 2009), incorporating human evaluations can provide valuable insights into the quality of generated texts and further enhance our understanding of their performance in real-world applications.

Ethics Statement
While the ultimate goal of this work is to build the first large-scale spoken texts benchmark, including LLM-generated texts, we understand that this dataset could be maliciously used.We evaluate this dataset with several SOTA AA, AV, and TT models and observe that there is room for improvement.However, we also observe that these findings could be used by malicious actors to improve the quality of LLM-generated spoken texts for harmful speech, in order to evade detection.Due to such reasons, we release this benchmark on huggingface's data repo to encourage researchers to build stronger and more robust detectors to mitigate such potential misuse.We also claim that by releasing this bench-mark, other security applications can be discovered to mitigate other risks which LLMs pose.Finally, we believe that the benefits of this benchmark, outweighs the risks.
Figure 5: Python code for loading HANSEN datasets from Hugging Face API a "good-usage" form.We have also checked the existence of Personal Identifiable Information (PII) using off-the-shelf automated tools7 , for all three modules.While it found the full name, affiliation, and program/website contact information, it did not find any sensitive PII, such as identification numbers or personal contact information.

B Selection of subsets for LLM prompting
Due to the computational costs of generating text from LLMs, we chose to work with a subset of Hansen human datasets.First, we select the datasets with enough metadata and well-known speakers (such as TED speakers, SEC commissioners, or Tennis players).TED, Spotify, and SEC datasets are mostly monologues (speech category), and CEO & Tennis are interview datasets.
For TED datasets, we removed the talks with music or instrumental focused.Spotify and Tennis contain numerous samples in the original version.Therefore, we considered the top speakers with the most samples and used them for LLM generation.Also, for Spotify, we removed the samples where the speaker is not individual, such as a tutorial channel or multi-party collaborations.Similarly, for the CEO dataset, we considered the subsets from ceo-today magazines since the questions are specific and guest-focused rather than the overall financial situation like the subset in Wall Street Journal.

C Prompts used for generating spoken-text
In our experiments, we explored different prompt techniques to enhance the coherence and semantic similarity of the generated content by LLMs.We observed that providing sufficient context with the prompt yielded better results.For the interview dataset categories, we included the original questions asked by the host.In the case of TED talks, we utilized the original talk description and the opening lines to allow the LLMs to learn the subtle style of the talks.We used the BART summarizer tool (Lewis et al., 2020) for the Spotify podcast and SEC speeches to obtain summaries, which were then used as prompts.Additionally, we explicitly instructed the LLMs to generate plain text to avoid any unwanted formatting.Table 11 shows the prompts used for different categories.
In the case of speeches, we instructed the LLMs to generate speech content using the same word counts as the original speech or up to the maximum allowed number of tokens (typically around 1024 tokens or approximately 800 words for LLMs).
D More about evaluating spoken nature of the AI-generated texts In section 3, we provide a brief analysis to show a comparison of LLM-generated spoken vs. written (Table 3) in parallel with the human spoken vs. written and how the differences align with the humans.To further validate, we have performed a small-scale study on a subset (100 samples from each) of the LLM-generated spoken (L s ) and corresponding written (L w ) for the TED dataset and consider another human spoken (H s ) and written (H w ) in different domains (H w from Xsum (Narayan et al., 2018) datasets: news articles, H s is from BASE dataset in HANSEN: academic conversations) to show the domain independence.We measure the stylistic difference using the cosine distance of features in Table 10.We observe dist(H w , L w ) < dist(H w , L s ) and dist(H s , L s ) < dist(H s , L w ) , which can provide an overall idea that our LLM-generated spoken datasets contain more similarity with the spoken language rather than the written ones.Figure 6 portrays the distribution difference between some features in the written & spoken samples.

E Experimental details for authorship analysis
The details about our AA, AV, and TT methods are discussed in Tables 13, 14, and 15.Since the development of new methods is not the primary purpose of our paper, we used them with the default configurations and hyper-parameter settings in most cases.For the text generations with LLM, we used the top_p = 0.7 to ensure more creativity as well as maintaining a coherent text and max_tokens to the highest limit for that LLM.
We applied a pre-processing step to ensure compatibility with various methods to achieve a uniform text length range for all authorship tasks.This involved removing samples from datasets that fell below-specified thresholds (100 for AA and AV, 200 for TT) as specific methods, such as Finetuned BERT, Adhominem, or OpenAI Detector, require specific text lengths.We sometimes combined multiple samples from the same conversation or interview to reach the desired length, as observed in datasets like MSU or CEO.Additionally, we employed sentence splitting for large samples such as TED or Spotify to create new samples with text lengths that remained within the maximum allowed token limits (approximately 1000 tokens) for different methods.
Table 16 shows the AA results for both small and large versions of the datasets.We consider large version N different for these datasets to ensure substantial samples per class for classification.We observe that performance drops more substantially for Transformer based methods when the number of speakers N increases.Therefore, it validates that DL methods underperform if per-class word counts decrease (Tyo et al., 2022).Also, character n-gram performs considerably better than word ngram in all scenarios.Similarly, Table 17 shows the AV results with different metrics for all datasets.While we observe a similar trend for classifiers in both AA and AV tasks compared to written text, the overall performance of these methods is less than different written datasets, as observed in previous studies (Stamatatos, 2009;Neal et al., 2017;Abbasi et al., 2022;Tyo et al., 2022).This leaves room for further investigation regarding a more nuanced analysis of spoken texts.
We have run the experiments for each setup five times for the AA, AV, and TT tasks and reported the average in all tables.In most cases, we get the exact results for the methods for AA & AV, except Table 12: Effect of multiple runs on the TT tasks.Change of labels was not observed for any methods.
those that include random initialization, such as character n-gram for AV with grid search.However, the standard deviation of these results lies in the 0.0001-0.005range, which will not impact the overall result comparison.Although we observe minor fluctuations in the text property across different runs within a limited subset of our testing (e.g., z-score from DetectGPT or Perplexity from GPT-Zero), it is essential to note that the output, indicating the likelihood of AI/human origin, remains consistent across all our experimental scenarios.Table 12 summarizes the findings for the TT methods.
A discussion about written and spoken texts of humans: The overall dimension and differences between written and spoken language from statistical perspectives is a well-studied problem in various corpus linguistics (Biber, 1991;Biber et al., 2000;Brown et al., 1983).While there is a notice-

Figure 1 :
Figure 1: Snippet of the HANSEN AI-text samples.Given the actual question (part of the prompt), the HU-MAN (ROGER FEDERER) answer vs three LLMs: ChatGPT, PaLM2, and Vicuna13B.

Table 1 :
Summary of the HANSEN Human datasets.
It contains interviews with the CEO of companies and other financial persons associated with stock markets.We created it from the available public transcripts of these interviews from three different sources: CEO-today magazine, Wall Street Journal, and Seeking Alpha website.

Table 2 :
LLM-generated samples in each category 1 link of the websites & other details in Appendix

Table 4 :
Macro-F1 score for different AA-methods for variable number of speakers.10(L) indicates the setup where each LLM has been considered as a speaker (7 human speakers with 3 LLMs).
Bold and underlined values represent each dataset's highest and second-highest-performing method (based on macro F1).
Table 7 reports the F1 score for several datasets.Unlike AA, Adhominem showed the best performance in AV, with finetuned BERT being particularly notable.

Table 6 :
AA result for N=10 speakers in different datasets.Results with higher value of N are present in Appendix.

Table 7 :
AV results (F1 score) for several datasets.Results for other metrics and datasets are presented in Appendix.

Table 8 :
The Mauve score (M) and avg recall value (R) (considering all detectors) for each LLM.A higher M means the distribution is more similar to humans.A higher R means that the LLM is easily detectable.The violet color indicates better performance for an LLM (higher Mauve score and lower recall value), while the red indicates worse performance (and vice versa).

Table 10
ever, all methods exhibit limitations across various settings, with either low precision or low recall.For instance, GPT Zero demonstrates low precision scores in interview datasets and TED talks, suggesting perplexity and burstiness measures may

Table 9 :
Results on combining both spoken & written samples from same individuals.SEC/PAN (w) specifies that the training set was written only and test set was spoken texts by same speakers.SEC/PAN (s) specifies the vice-versa.Each cell value is the macro F1 score and the difference in F1 score (∆) with the original PAN/SEC dataset in the AA-10 class problem.

Table 10 :
TT results on different datasets (Speech (S) or Interview (I)) for three LLMs: ChatGPT (gpt3.5),PaLM 2, and Vicuna-13B with precision, recall, and F1 score sequentially in each cell.The bold scores indicate the highest performing method (based on F1 score).The underlined scores indicate low precision scores (predicting most texts as AI-generated).The bold & underlined scores show a low recall score (can not detect most AI-generated texts).