It’s not Sexually Suggestive; It’s Educative | Separating Sex Education from Suggestive Content on TikTok videos

,


Introduction
In short-form videos such as in TikTok, accurately identifying sexually suggestive and sex education content amidst a sea of diverse video types poses a significant challenge.In this paper, we delve into this problem, focusing specifically on TikTok, the most downloaded app in 2022, which has a sub stantial user base of early adolescents and young individuals (10-19: 32.5%, 20-29: 29.5%) 1The distinction between suggestive videos and virtual sex education holds crucial significance on multiple fronts.Adolescent sex education in the United States is delivered in a fragmented and often inadequate system, which has long been the On the left, Nyko (@kingnyko2022) addresses a question about his gender transition.The right is from a sexually suggestive video.
(1) Educative (Description) : Video featuring a man discussing a topic while a prominent illustration of a p*n*s with pearly penile papules serves as the background.
(2) Suggestive (Description) : Video shows a man holding a pumpkin over his torso while a woman enthusiastically moves her hand inside, exclaiming, "There is so much in there." (3) Educative (Transcript) : The average banana in the United States is about 5.5 inches long.That's the perfect size for baking banana bread most of the time because ... (4) Suggestive (Transcript) : You are such a good boy.Daddy's so proud of you.
Table 1: Examples from the dataset, the first two are descriptions, and the latter are video transcripts.subject of intense criticism and is vulnerable to political influence (Fowler et al., 2021).In this context, TikTok presents a novel and promising avenue to conveying comprehensive and accessible sexual health information to adolescents, offering a convenient, private, and inclusive space for learning and discussion (Fowler et al., 2022).At the same time, children's exposure to sexual media content has been found to influence attitudes and contribute to the formation of adversarial sexual beliefs (Collins et al., 2017).
Unfortunately, efforts to moderate explicit content had unintended consequences, as studies have demonstrated the misidentification of non-explicit content due to flawed algorithms and filtering tech niques Peters, 2020.In addition to the above issue, video/video creators (referred to as creators from now on) may also be susceptible to mass reporting.Creators from marginalized communities, partic ularly those within the LGBTQIA+ community, face heightened risks of having their educational content wrongfully flagged or removed2 .
The classification of sexually suggestive and sex education videos presents a complex task, as demonstrated by the examples shown in Table 1.
In example 1, we see that p*n*s illustration is not suggestive, while the video with a man holding a pumpkin in example 2 is suggestive.When we look at the transcripts, we see that in example 3, the creator is talking about myths around p*n*s sizes for pleasurable sex, and in example 4, the audio is suggestive.Considering these complexities, accurately categorizing sexually suggestive and sex education videos necessitates a nuanced understanding of contextual cues, subjectivity, evolving language, and robust algorithmic solutions.
The contributions of the paper are as follows: 1. Introduction of SexTok: A collection of 1000 TikTok videos labeled as Sexually Suggestive, Sex Education, or Others, along with perceived gender expression and transcription.

Baselines Evaluation:
We evaluate two transformer-based classifiers as baselines for the task of classifying these videos.Our results indicate that accurately distinguishing between these video types is a learnable yet challenging task.

Trigger Warning: Sexual Content and Explicit Language
Please be advised that this research paper and its associated content discuss and analyze sexually suggestive and sex education videos.The examples and discussions within this paper may contain explicit or implicit references to sexual acts, body parts, and related topics.The language used may sometimes be explicit.This material is intended for academic and research purposes and is presented to address challenges in content identification and classification.Most works around nudity detection are focused on skin-colored region segmentation to identify nudity.This methodology has been extensively explored in the image domain (Fleck et al., 1996), (Wang et al., 2005) (Platzer et al., 2014), (Garcia et al., 2018), (Lee et al., 2006).(Ganguly et al., 2017)'s work, apart from focusing on the percentage of skin exposure, also gave attention to the body posture of the human in the image and the person's gestures and facial expressions.An alternative strategy is the Bag of Visual Words model, in which the idea is to minimize the existing semantic gap between the low-level visual features and the high-level concepts about pornography.(Deselaers et al., 2008), (Lopes et al., 2009), (Ulges and Stahl, 2011), (Zhang et al., 2013).Approaches based on motion analysis, apart from other features, also capture motion, such as using the periodicity in motion, such as in (Rea et al., 2006).(Zuo et al., 2008) uses a Gaussian mixture model (GMM) to recognize porno-sounds, a contour-based image recognition algorithm to detect pornographic imagery, and are combined for the final decision.

Related Work
Yet still, sexual activity where the human is mostly clothed or has minimal movement is still challenging.Peters, 2020 studied issues surrounding publicly deployed moderation techniques and called for reconsidering how platforms approach this area, especially due to it's high false positive rates and/or low precision rates for certain types of actions.

SexTok Dataset
This section presents the SexTok dataset3 , a collection of 1000 TikTok video links accompanied by three key features: Class Label, Gender Expression, and Audio Transcriptions.

Class Label
The first feature, Class Label, is a categorical variable with three possible values: Sexually Suggestive, Sex Education, and Others: Sexually Suggestive: This category encompasses videos that purposefully intend to elicit a sexual response from viewers.Determining the presence of sexually suggestive content is subjective.
Sex Education: This category encompasses videos aimed at enhancing viewers' knowledge, skills, and attitudes concerning sexual and reproductive health.It covers various topics, including but not limited to sexual orientation, gender, and gender-affirming care.
Others: This category encompasses videos that do not fall within the aforementioned sexually suggestive or sex education categories.

Gender Expression
Gender expression is a form of self-expression that refers to how people may express their gender identity (Summers, 2016).In this paper, we focus solely on the physical visual cues associated with gender expression.We provide five gender expression labels in the dataset: Feminine, Masculine, Nonconforming, Diverse, and None.
Feminine and Masculine represent predominantly feminine or masculine expressions, while Non-conforming refers to expressions that deviate from traditional norms.Diverse applies to videos with varying gender expressions among multiple individuals.The None label is for videos without people or only limited visual cues like hands.
The information for the vast majority is not selfreported.When available through the video itself, profile descriptions, or hashtags, we incorporate that information.Otherwise, the annotation is based on the perception of the annotator.This feature is provided only to serve the purpose of evaluating bias in models built on the dataset.

Data Collection
The data collection process involved the primary annotator creating a new TikTok account and interacting with the platform in various ways to collect the video links.They carefully watched and hand-selected videos.

Annotator Agreement
A 10% sample of the dataset was independently annotated by a second author to ensure reliability.Cohen's Kappa scores (Cohen, 1960) were used to assess annotator agreement.For Gender Expression, the Kappa score was 0.89, indicating substantial agreement.For Class Label, the Kappa score was 0.93, indicating high agreement.These scores validate the consistency and quality of the dataset's annotations.

Data Processing: Video download and Audio transcription
The videos were downloaded without the TikTok watermark using a TikTok downloader. 4 .The watermark was removed to reduce unnecessary noise in the data.
A smaller sample of videos was first transcribed using OpenAI's whisper (medium) (Radford et al., 2022) and was manually checked for accuracy.The transcriptions were mostly perfect, with a word error rate of 1.79%.After this, all the videos were automatically transcribed using Open AI's Whisper (medium).

Dataset Properties
In this section, we provide some general statistics about the SexTok dataset.The dataset comprises 1000 TikTok video links with three features: Class   2.
When the audio was transcribed, a percentage of videos were found not to have any text in the audio transcription, specifically → Suggestive -15.85%,Educative -3.97%,Others -8.4%.
We also observe that suggestive videos tend to be shorter (median duration: 7.86 secs), and have shorter audio transcriptions (median number of words: 14 words), compared to educative videos that are longer (median duration: 50.80 secs) and have longer audio transcriptions (median number of words: 171.5 words).Detailed dataset video length and transcription length are given in Appendix A.)

Experimental Setups
In this section, we evaluate the performance of pretrained transformer-based models on the SexTok dataset to assess its significance.The experiments are divided into two subsections: text classification using video transcripts and video classification.
For both transformer-based setups, we utilized models downloaded from Hugging Face Transformers (Wolf et al., 2020), initializing them with three random numbers.Details on hyperparameters are in Appendix C. The reported results are the average of three runs.To assess the performance, we employed four sets of metrics: (1) accuracy, (2) micro precision, recall, and F1 (excluding Others as a negative class from the scores), (3) macro precision, recall, and F1, and (4) overall F1 for each class.

Text Classification using Video Transcript
We fine-tuned bert-base-multilingual-cased (Devlin et al., 2018) to perform text classification on the video transcripts.Since we observed that a small percentage of videos do not yield any text in their transcription, we experimented with two setups.One with all video transcriptions and the other with non-empty transcriptions.

Video Classification
We fine-tuned MCG-NJU/videomae-base, a Video-MAE base model (Tong et al., 2022) for video classification.The image clips were randomly sampled and preprocessed to align with the default configurations of the model.

Results and Error Analysis
The average performance and standard deviation of the models are presented in Tables 4 and 5. Based on these results, we draw the following observations: • The most accurate model is the text classifier that evaluated videos with a transcription (75%).It demonstrates relatively better performance in identifying educative content but often struggles to differentiate between suggestive content and others, and vice versa.However, it should be noted that this implementation is not realistic in a real-world scenario, as TikTok videos can vary in terms of sound presence and spoken language.
• Both text-based classifiers exhibit higher F1 scores than the video classifier for the Educative and Others classes.But their performance in detecting suggestive content is is comparatively lower than that of the video classifier.
• Notably, neither of the text-based classifiers misclassifies suggestive content as educative, or vice versa, as evident from the confusion matrices in Appendix C.
• The video classifier achieves the highest F1 score for the Suggestive class.However, it frequently confuses Educative and Other videos with each other.
To further understand the hard examples for the model, we manually categorized the errors in both text and video classification experiment setups.
We analysed 54 errors in text classification model.If more than one option was applicable, the video was counted in both: (a) Audio unrelated to class label (50.00%):The audio in these videos consisted of popular songs or speeches that did not contain any words typically associated with the class label.(b) Context clues and Euphemism (25.07%) : These videos relied on context clues or employed euphemistic language (9.26%) or required audio analysis considering the tone and intonation to predict the class label (14.81%).(c) No or partial transcription (14.81%):Approximately 9.26% of the videos had no audio that could be transcribed, while 5.56% had only partial transcriptions available.We analyzed 52 errors in video classification.All educative videos that were classified as others, and vice versa, had the same format that both classes do, i.e., a person looking at the camera speaking.Of the 11 suggestive videos that were not classified correctly, in 63% of videos, some or all of the video frames had fully or mostly clothed people featured in the video.A detailed analysis using Transformers-interpret C (Pierse, 2021) also shows that the text classification shows some signs of overfitting to text.

Discussion
The results highlight the complexity of accurately identifying sexually suggestive and educative videos on platforms like TikTok.While the results indicate that text analysis can contribute to detecting educative videos, music clips unrelated to the video topic are commonly used, making reliance on transcription alone insufficient.While existing work in pornographic content detection primarily focuses on visual analysis, our results indicate the need for a multi-modal approach since detecting sexual content requires a more comprehensive understanding encompassing multiple senses, including audio, speech, and text.Addressing these challenges is crucial for developing effective content moderation systems, ensuring appropriate access to sex education, and creating a safer and more inclusive online environment.It is also crucial to be mindful of potential gender expression bias commonly found in visual datasets (Meister et al., 2022).Moreover, for tasks like this, developing scalable solutions suitable for large-scale systems with millions of users is crucial for effective implementation.Further exploration and investigation of these aspects are left for future research and development.

Conclusion
This paper introduces a novel task of identifying sexually suggestive and sex-educative videos and presents SexTok, a multi-modal dataset for this purpose.The dataset includes video links labeled for sexual suggestiveness, sex-educational content, and an other category, along with gender expression and audio transcription.The results highlight the challenging and multi-modal nature of the task and suggest that while the dataset is meaningful and the task is learnable, it remains a challenging problem that deserves future research.This work contributes to promoting online safety and a balanced digital environment.

Acknowledgement
This work was partially funded by the LGBTQ+ Grad Student Research Funds by The Institute for LGBTQ Studies at the University Of Arizona.We deeply appreciate the invaluable contributions of Shreya Nupur Shakya throughout this work.

Limitations
We address the limitations of the SexTok dataset and the accompanying experiments here.

SexTok Dataset
• The TikTok account was created and used from a specific geographic location (which will be disclosed in the final version if accepted).This is important to note since the content recommendation of TikTok is influenced by geographic location,5 among other things; hence a geographic bias may be expected, i.e., certain demographics may be more represented than others, especially in terms of languages used, race, ethnicity, etc.
• The data gathered only represents a small sample of the content available on TikTok and may not represent the entire population of TikTok users or videos.
• Sexual suggestiveness is treated as a discrete class label in the project, whereas in the real world, it has two important properties.1) The perception of what is sexually suggestive may vary depending on the individual's sexual orientation, worldview, culture, location, and experiences and is highly subjective.2) Some are more suggestive than others, and we do not account for the variation in the strength of suggestiveness here.
• The dataset is a small snapshot of the TikTok videos from October 2022 to January 2023.Patterns, slang, and other cues may change over time.
• Gender expression has many variations but is referred to as discrete labels here, but in real life, it is not.Additionally, this is as perceived by one annotator and, for the majority, not self-reported by the person in the video.Additional expert annotators may be needed to strengthen the confidence in the label.
• Despite best efforts, it may be possible that the same creator appears more than five times.This is because creators often create multiple accounts to serve as a backup in case Tik-Tok takes down the original account.This is observed to be increasingly common in the sexually-suggestive and sex-ed domains.We show an example in Figure 2 Other details : The audio content of the TikTok videos comprises various elements, including background music, spoken dialogue (not necessarily from the video creator), or a combination of both.
Notably, TikTok provides voice effects that enable users to modify their voices using predefined options.

Experiments
• The audio transcription of the videos was created automatically using Open AI's Whispermedium (Radford et al., 2022).Hence this is subject to errors, which may impact the performance of the models.
• For training the models, GPU computing power was used.

Ethics Statement
We address the ethical considerations and consequences of the SexTok dataset and the accompanying experiments here.
• The study's focus is on the technical aspects of the problem.It does not address the broader societal and ethical implications of censorship and of regulating sexually suggestive content on social media platforms.The work only aims to detect sexually suggestive content and sex education content against other video topics but makes no stand on censorship or content regulation of sexually suggestive videos.
• Sexual suggestiveness, as well as perceived gender expression, is a subjective matter and is hence susceptible to annotators' bias.
• Gender expression, specifically visual cues only, was annotated and offered only to evaluate bias based on visual cues since such biases are known to exist within large-scale visual datasets (Meister et al., 2022).The authors do not condone the practice of assigning gender identity based on a person's external appearance since gender is an internal sense of identity (Association, 2015).This dataset is not intended to be used for any such practices.
• Due to the nature of the problem, and potential licensing issues, the publicly-collected data is not anonymized.C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?We discuss the experimental setup, including model sources, in Section 4 C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Results are reported in Tables 3 and 4 and are made clear how we reached them.( average of three random runs are reported with the standard deviation.The codebase will be shared if the paper is accepted -for reproducibility and testing.

D Transformer Interpret
C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Sections 3 and 4. D Did you use human annotators (e.g., crowdworkers) or research with human participants?Section 3 D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?The data collected is public.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?May de-anonymise the paper.It will be shared once when the paper is accepted.

Figure 1 :
Figure 1: Two screenshots from videos in the dataset.On the left, Nyko (@kingnyko2022) addresses a question about his gender transition.The right is from a sexually suggestive video.

Figure 2 :
Figure2: This is a partial screenshot from an audio profile page on Tiktok.Each rectangle is a cover image of a video that uses the same audio.The text on the bottom left of each video is the username of the creator of that video.We can see that the same person has multiple accounts posting the same video.
Refer to Figure3on the next page. ]

Figure 3 :
Figure 3: Three example transcription and its predictions explanation visualized using Transformers Interpret, a model explainability tool.

Table 2 :
Two important considerations were taken into account during the dataset Video Distribution by Dataset Split and Class Label.Sugg: Suggestive, Edu: Educative.The dataset consists mostly of general videos that do not fall into the categories of sexually suggestive or educative.This reflects a more realistic representation of Tiktok's environment.

Table 5 :
We present the overall F1 of each class label with the average and standard deviation of three random runs.Text-based classification gives a higher F1 for educative content when transcription is present, but suggestive content is detected best in videos where educative content is misclassified higher.

Table 6 :
Mean, Median, and Standard Deviation of words present in video transcripts.Words were tokenized using the NLTK package.Sugg stands for Suggestive, and Edu stands for educative.Suggestive videos tend to be significantly shorter than the other classes.

Table 7 :
Mean, Median, and Standard Deviation of videos in the dataset in seconds.Sugg stands for Suggestive, and Edu stands for educative.Suggestive videos tend to be significantly shorter than the other classes.

Table 8 :
Hyperparameters used for the Text Classification Task

Table 9 :
Hyperparameters used for the Video Classification Task