Utilizing Weak Supervision to Generate Indonesian Conservation Datasets

Weak supervision has emerged as a promising approach for rapid and large-scale dataset creation in response to the increasing demand for accelerated NLP development. By leveraging labeling functions, weak supervision allows practitioners to generate datasets quickly by creating learned label models that produce soft-labeled datasets. This paper aims to show how such an approach can be utilized to build an Indonesian NLP dataset from conservation news text. We construct two types of datasets: multi-label classification and sentiment classification. We then provide baseline experiments using various pretrained language models. These baseline results demonstrate test performances of 59.79% accuracy and 55.72% F1-score for sentiment classification, 66.87% F1-score-macro, 71.5% F1-score-micro, and 83.67% ROC-AUC for multi-label classification. Additionally, we release the datasets and labeling functions used in this work for further research and exploration.


Introduction
Labeled datasets play a crucial role in Natural Language Processing (NLP) tasks.However, generating large-scale labeled datasets with high quality remains a significant challenge.In addressing this challenge, weak supervision has emerged as a promising approach (Ratner et al., 2016;Zhang et al., 2022) to create labeled datasets by leveraging weak classifiers and aggregating their outputs into noisy labels that approximate the unobserved ground truth.Empirical studies (Ratner et al., 2019(Ratner et al., , 2020;;Ren et al., 2020;Lan et al., 2020;Lison et al., 2021) have demonstrated the competitiveness of this approach compared to manual data collection processes.Additionally, benchmarks (Zhang et al., 2021) have been estab- * *Work completed in Datasaur.ailished to further evaluate new weak-supervision approaches.The idea of weak supervision methods to create datasets is something that can be useful for under-resourced languages such as Indonesian.The current approach to manually label every sample is a painful process that realistically results in relatively small amounts of data when budgets are limited.Another approach could be crowdsourcing (Cahyawijaya et al., 2022), but this approach rests upon many contributors whose quality or theme can be vastly different.To add more resources from downstream tasks in the Indonesian language, we seek to build datasets at larger sizes by using weak supervision.In this paper, we utilize Datasaur's weak supervision framework, Data programming, to facilitate the creation of multiple diverse Indonesian downstream datasets with conservation as the overarching theme.Our focus is leveraging the Mongabay conservation news dataset into two NLP tasks: multi-label classification and sentiment classification.We construct a hashtag classification dataset for multi-label classification, considering the importance of hashtag classification in organizing and enhancing searchability and granularity in editorial realm 1 .Then, we also use utilize hashtag classification dataset to build sentiment classification dataset by categorizing groups of hashtags into related sentiment categories (positive, neutral, negative).For building both datasets (hashtag classification and sentiment classification), we employ a range of simple labeling functions (Ratner et al., 2020) through Datasaur's Data Programming.Our methodology encompasses dataset construction, weak-labeled dataset learnability experiments through various BERT pre-trained models, and performance analysis of labeling functions.We further elaborate our approach, outlining dataset and labeling functions collections, along with the associated benefits and constraints.

Related Work
Weak Supervision.Machine learning requires a large amount of labeled data, which can be costly and difficult to scale.To reduce the cost of building a dataset, weak supervision uses a heuristic approach to generate a training set with noisy labels from multiple sources (Ratner et al., 2016(Ratner et al., , 2020;;Alexander et al., 2022).The result is a dataset with probabilities as labels or what can be called soft labels.The idea behind this approach is to encode knowledge from experts into labeling functions.These labeling functions serve as a weak classifier where they individually cannot yield a good prediction but when used in tandem with many other label functions can be an effective approximation to the unobserved ground truth.The process is then concluded with a generative model that aims to model the labeling functions by taking into account the agreement and disagreement between labeling functions (Ratner et al., 2016(Ratner et al., , 2020;;Alexander et al., 2022) (later on, this generative model will be called label model).The output of a label model is a noisy signal that estimates the true labels which can be used to predict the soft-labels of a sample.The results have shown considerable gains and are highly applicable in real-world industries for reducing the cost of hand-labeled data (Ratner et al., 2016(Ratner et al., , 2020)).
These related works used (Ratner et al., 2016) version of weak supervision to build training sets.In Ratner et al. (2016), experiments were conducted using the 2014 TAC-KBP Slot Filling dataset.The results achieved 6 F1 points over a state-of-the-art LSTM.In another study, weak supervision was used to build the ORCAS dataset for user intent classification tasks (Alexander et al., 2022).The ORCAS dataset consists of query ID, query, document ID, and clicked URL.To label the dataset, a 2-million sample of data was used.The researchers conducted experiments with machine learning models and evaluated the results.The findings indicated competitive results and high efficiency in real-world problems, where labeling functions can be easily executed for every query issued.Finally, a user study showed that using the pipeline of a weak supervision framework can increase predic-tive performance even faster than seven hours of hand labeling (Ratner et al., 2020).
Sentiment Classification.Sentiment classification is a classification task to extract the polarity or sentiment expression in text (Davidov et al., 2010).Several works used data from social media, such as Twitter, for classification tasks and leveraged the hashtags as additional features alongside the content of the tweet (Davidov et al., 2010;Devi et al., 2019;Diao et al., 2023).(Davidov et al., 2010) proposed a supervised sentiment classification framework using Twitter tags and 15 smileys as features.The results showed good performance in labeling data without manual annotation.Another study (Devi et al., 2019) explored the hashtag and content tweets to predict which hashtags will become trending in the future.By using machine learning approaches, hashtags as features contribute the better results of the model.Also (Diao et al., 2023) uses hashtags to provide auxiliary signals to get labels for the data.They generated hashtags from input text to produce new input for the model, with meaningful hashtags.The hashtag generator uses an encoder and decoder to predict the hashtags, and the results showed significant improvements in tweet classification tasks.
Multi-Label Classification.In comparison to other text classification research, the field of multi-label classification remains relatively underexplored.Nonetheless, it constitutes a valuable NLP task for extracting metadata from extensive textual datasets, like research papers and articles.For instance, Li and Ou (2021) leveraged a KNNbased model to address the challenges posed by multi-label classification in the context of research papers.

Building Dataset Using Weak Supervision
Building large datasets using weak supervision has been demonstrated to be effective and of high quality in many studies.For instance, in a recent study by Tekumalla and Banda (2022), they curated a silver standard dataset (samples of raw data sources that have good enough quality to be trained, which were collected and cleaned using weak supervision heuristics) for natural disasters using weak supervision.Similarly, another study by Painter et al. (2022) utilized weak supervision to create a silver-standard sarcasm-annotated dataset (S3D) containing over 100,000 tweets.This approach holds great promise for expanding the availability of labeled datasets, facilitating the development of more accurate and robust machine learning models.
Our research is focused on curating Indonesian conservation datasets using a weak supervision framework, which has been adapted from Snorkel's works (Ratner et al., 2016(Ratner et al., , 2020;;Alexander et al., 2022).To make the process as user-friendly as possible, we have developed interactive weak supervision tools, known as Data Programming, in Datasaur 2 workspace.Our Data Programming is integrated with a simple code editor in the workspace, which allows users to create labeling functions interactively (figure 1).We've also provided a Python labeling function template, as detailed in A.1.The predictions generated by each labeling function are processed in the background by a label model (Ratner et al., 2019).
Our Data Programming returns two types of results: probability outputs, which are used as soft labels in the fine-tuning process, and hard-label predictions.These hard-label predictions can be reviewed and revised by human annotators directly in our workspace.In this work, we use the probability results in our training set and revise the hard-label prediction for validation and test set, as our goldenset.

Dataset Source
We collect articles from the Indonesian conservation news collection, Mongabay3 , in 2012 -2023 period.The raw dataset was sampled from either the first or last 100 articles in each year.These articles were then segmented into multiple chunks, with each chunk containing a maximum of 512 tokens, The format for each data point is as follows: {title} ; {chunked-article} This process resulted in a total of 4896 chunked articles, which were split into 3919, 492, and 485 chunked articles as training set, validation set, and test set respectively.

Task
We categorized the scraped dataset into two primary tasks: multi-label classification and sentiment classification.The multi-label classification task aimed to depict the distribution of hashtags within the dataset.On the other hand, the sentiment classification task was undertaken to gauge the sentiment of the authors, which was still embedded in the articles.For the sentiment classification task, we constructed it based on the hashtag distribution dataset, as previously employed in these works (Devi et al., 2019;Davidov et al., 2010).
We defined 31 classes for the hashtag classification task and 3 classes for the sentiment classification task.The 31 tags were collected by our labelers through an internal analysis of popular environmental and news topics among Indonesian citizens, as shown in Figure 2. We acknowledge that this approach heavily depends on the knowledge and personal experience of the labelers in the field of conservation news and environmental topics.The subsequent section (Section 3.3) will offer more comprehensive insights into the dataset construction.

Dataset Construction
As mentioned in Section 3, our data programming generates hard labels and probability labels.In this work, our strategy involves using data programming for labeling the entire dataset, including the training, validation, and test sets.We use the probability outputs as our training set, while our validation and test sets use hard-label outputs that have been reviewed and revised by human annotators, forming our golden-set.When reviewing the hard-label predictions, our labelers follow this simple guideline: • Hashtag Classification: A hashtag is assigned to a chunked article if it is discussed in the text, even if it's mentioned as a side effect.
• sentiment Classification: Chunked articles are labeled as follows: 1) negative: If they mention any conflict or victims.2) neutral: If there is no discernible sentiment tone, the article is purely descriptive, OR it contains both a conflict and its resolution.3) positive: If the article discusses trivia topics or initiatives aimed at solving environmental issues.
The dataset construction processes for both hashtag classification and sentiment classification involve three key steps: labeling function construction, labeling function analysis, and label model and final prediction.

Labeling Function Construction
The labeling functions used in this study are based on collected keywords from the labeler's perspective.In certain cases, additional rules and logic were added to augment the labeling functions.For the hashtag classification task, labelers gathered relevant keywords for each collected hashtag.In the sentiment classification task, the labeling functions relied on aggregated tags on each article corresponding to positive, neutral, or negative labels.
For the sentiment classification task, we developed two versions of labeling functions: the default version, v0, which was used in the main experiment, and v1 which has more specific logic.The detailed methodology for building labeling functions in the tags classification task and sentiment classification task undertaken in this work is presented in Appendix A.1.

Labeling Function Analysis
To evaluate the performance of sentiment classification labeling functions, we used coverage, overlaps, and conflict statistics, which have been defined in Ratner et al. (2016).However, in tags classification, we only utilize coverage score to represent the density of tags in each article, as other metrics such as overlaps and conflict did not adequately reflect the quality of the labeling functions.As discussed in Section 3.3.1,for the sentiment classification task, we developed v1 labeling functions with more specific logic.This resulted in higher level of inter-independence among the labeling functions, leading to lower coverage and conflict scores (Table 1).Notably, the v1 labeling functions exhibited a significantly smaller percentage of conflict/coverage (2.9%) compared to v0 (32%).As highlighted in Ratner et al. (2016), the statistical performance of the labeling functions directly impacts the quality of the final label prediction and the performance of fine-tuned models.Hence, we conducted experiments and analyzed the influence of the quality of two labeling function versions (v0: prioritizing coverage score; v1: prioritizing less conflict) for the sentiment classification task, as analyzed in section 4.4.1.Additionally, the constructed dataset, as well as the raw dataset, can be accessed in https://huggingface.co/datasets/ Datasaur/mongabay-experiment.

Pipelines
The experiment pipelines have two main goals: 1) To compare the performance of the covariance matrix and the majority voter-generated dataset for each task, and 2) To assess the performance of different BERT pre-trained models with varying language bases when fine-tuned using the weak dataset.This comparison enables an evaluation of the effectiveness of different approaches to weak supervision for various NLP tasks.To accomplish these objectives, the experiment pipeline is structured as follows: • Each variation of the pre-trained models will be trained using the soft-label dataset (both the covariance matrix and the majority voter version).
• The models will be iteratively evaluated using the gold-label validation set at each epoch.
• The weight configuration yielding the best validation metrics will be tested using the goldlabel test set.
This pipeline was executed for both the tags classification and sentiment classification tasks.indobert-base-uncased4 , bert-base-multilingual-cased5 , and bert-base-cased6 .We employed these models to assess the performance of our weakly labeled dataset when learned by three distinct models: 1) Indonesian monolingual (pre-trained with the same language as our data), 2) Multilingual (pre-trained with multiple languages, including the same language as our data), 3) English monolingual (pre-trained with a language different from our data).In finetuning hashtag classification, we utilized cross-entropy loss instead of binary-cross-entropy loss.However, in the inference session, we keep using binary-cross-entropy loss.This decision is based on the soft-label distribution of the training set obtained from the weak supervision process, which is not binary for each class, in contrast to the binary distribution of the gold labels in the validation and test sets.We utilize the hyperparameter setup outlined in (Table 7) for the fine-tuning of each task.

Analysis
The experiment results were analyzed from two key perspectives: 1) Label model comparison perspective, and 2) Dataset learnability.From the perspective of dataset learnability, the Learning process went well by the Indobert pretrained model, represented by significantly better performance compared to the Multilingual BERT (mbert) and BERT Base models, as evident in the results presented in Table 3, Figure 4, and Table 5 In the context of label model comparison, the results vary between multi-label and sentiment classification tasks.For sentiment classification (as shown in Table 3), the Covariance Matrix (CM) approximates the human judgment (gold label) more accurately.In contrast, Majority Voter (MV) predictions are more closely correlated with the gold label in multi-label classification, as indicated in Table 5.

Sentiment Classification
The highest performance, reaching 60% accuracy and a 56% F1-score (macro average), was achieved by utilizing the Covariance Matrix label model in conjunction with the Indobert pre-trained model, as shown in Table 3. Detailed sentiment classification F1-scores for each label (refer to Appendix A.2.1) reveal that our finetuned models (Indobert and Multilingual-Bert) exhibit a tendency to predict negative articles accurately, with F1-scores exceeding 70%.This observation aligns with our dataset's characteristic (Figure 3), where negative articles are prominently distributed.However, Indobert faces challenges when predicting articles associated with the positive or neutral class, resulting in F1-scores hovering around 40%.This could be attributed to the similarity between positive and neutral-related tags (Figure 3).Indobert also exhibits a faster learnability rate compared to other pretrained models, supported by the loss graph in plementary experiment explores how the quality of labeling functions affects model performance.As shown in Table 4, despite LFs v0 having ten times higher conflict proportion compared to LFs v1, the v0 labeling functions significantly outperform v1 in fine-tuned models.This suggests that higher cov- erage and overlaps contribute to enhanced performance in fine-tuned models, even in the presence of increased conflict.This finding is consistent with the statement in Ratner et al. ( 2020) that higher coverage results in higher accuracy.

Tags Classification
The tags classification results are depicted in microaverage F1-score and ROC-AUC due to the imbalanced distribution of tags in our dataset (Figure 3).As stated earlier in Section 4.4, Majority Voter (MV) consistently outperforms the Covariance Matrix (CM) by approximately 2-3%.This aligns with the findings in Zhang et al. (2021), which reported that Majority Voters as the label model achieved superior performance when dealing with sparse labels in tags classification.The highest performance is observed when using Majority Voter as the label model and Indobert for finetuning, resulting in a test performance of 81.89% ROC-AUC, 65.71% F1-score macro_average , and 69.9% F1-score micro_average .
A more detailed analysis of tags classification F1scores for each label is provided in Appendix A.2.1.Tags associated with negative sentiment achieve F1-scores exceeding 70% with either Indobert or Multilingual-Bert.In contrast, tags primarily composed of non-negative sentiment classes do not perform as effectively.This result aligns with the sentiment classification results.Since we utilize different losses during training and inference sessions, we cannot compare the train and eval loss in a single frame (Section 4.3).

Limitations
Our dataset construction using weak supervision relies heavily on our labelers' subjectivity, particularly in collecting hashtags and creating labeling functions.These functions are designed to match the unique characteristics of our dataset, and their effectiveness may not extend well to other datasets, even with similar characteristics.The low results in our experiments are attributed to biases within the dataset.However, these biases provide insights into the Indonesian environmental and conservation editorial landscape, albeit presenting challenges for future efforts in creating more equitable datasets.
Although our data programming is primarily designed for single/multi-class classification, through this work it can be utilized for multi-label classification due to the one-hot-encoded output format.However, we have not yet implemented any metrics for evaluating labeling function performance, aside from coverage, which represents the dataset's labeling density.

Conclusion
In conclusion, we've presented our weak supervision pipeline for creating two datasets: sentiment classification and multi-label classification.We utilized Mongabay conservation articles collection to construct our datasets and adapted Snorkel's framework in Datasaur's workspace for conducting this work.Our results show that utilizing Data Programming to curate datasets can deliver qualified datasets that are learnable by BERT pre-trained models, especially indobert.However, some limitations remain, such as labeling function subjectivity, incomplete multi-label classification labeling function metrics, and the implicit bias within this dataset.Future work will curate more datasets more robustly and reproducibly, especially for NLP datasets with underrepresented language, topic, or task.

Ethics Statement
We acknowledge the presence of implicit biases in both our dataset source and the constructed dataset.
Additionally, as we utilized news datasets, they may contain certain viewpoints from editors or journalists.It is ensured that the dataset is free from harmful or offensive content, biases may still exist in our model and results.As a news dataset, it contains formal Indonesian-native content.For the simplest labeling functions, we only inserted keywords into TARGET_KEYWORDS, and add additional code/rules below LABELS when required.
• The simplest labeling functions consists of rules that return tags labels related to obvious keywords.

Figure 1 :
Figure 1: Integrated Labeling function editor in Datasaur workspace

Figure 2 :
Figure 2: The representation of 31 tags in our dataset; yellow box is our tag label and green box is special class because the article consists of many different articles.

Figure 3 :
Figure 3: Distribution of validation and test set gold label for tags classification (left) and sentiment classification (right).Detailed explanation about tags definition was provided in Appendix A.3

Table 1 :
The performance of the labeling functions varied for each task(sentiment and tags classification).From the gold-label distribution (Figure3), it can be inferred that the sampled Mongabay articles have a bias towards negative sentiment, which is distributed to the entire tags.Refer to (fig 2), all tags from conflict are predominantly associated with negative articles, while a few tags from trivia and solution are more commonly found in positive articles, and other tags from conflict and solution were majorly included in neutral articles.It is worth noting that, each article can match with more than one tag considering the varies of author's writing style.For example: one article with negative sentiment can have various tags such as go-green, konflik, korupsi, LSM, which indicates the article discusses conflicts in go-green action/regulation, highlights corruption issues, and mentions the involvement of LSM (NGOs).The format of the experimental dataset for training, validation, and testing can be seen in Table2.

Table 2 :
Snippet of constructed dataset (top-bottom): 1: train set sample with 31 probabilities as soft-label for tags classification experiment 2: train set sample with 3 probabilities as soft-label for sentiment classification experiment 3: validation set sample with 31 tags in binary label for tags classification experiment 4: validation set sample

Table 3 :
Validation and test results from sentiment classification experiment, use labeling functions v0.The performance was gained from a model with the best validation score.CM: using Covariance Matrix as label model; MV: using Majority Voter as label model

Table 4 :
Validation and test results from sentiment classification labeling function variations experiment using indobert and Covariance Matrix.High coverage (v0), even with high conflict, gives the best performance than more precise and accurate labeling functions result (v1)

Table 5 :
Validation and test results from tags classification experiment.The performance was gained from a model with the best validation score.CM: Using Covariance Matrix as label model; MV: Using Majority Voter as label model.R/A:ROC-AUC; F1-ma: F1-score macro average; F1-mi: F1-score micro average A.1 Labeling Function ConstructionOur labeling functions were supported with external Python libraries, such as SpaCy 9 , NLTK 10 , TextBlob 11 , and Stanza 12 .We have standardized the labeling functions' code into this template.So varies of Python algorithms can be implemented under label_function.