Charibeth Cheng


2024

pdf bib
Language Identification of Philippine Creole Spanish: Discriminating Chavacano From Related Languages
Aileen Joan Vicente | Charibeth Cheng
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

Chavacano is a Spanish Creole widely spoken in the southern regions of the Philippines. It is one of the many Philippine languages yet to be studied computationally. This paper presents the development of a language identification model of Chavacano to distinguish it from languages that influence its creolization using character convolutional networks. Unlike studies that discriminated similar languages based on geographical proximity, this paper reports a similarity focused on the creolization of a language. We established the similarity of Chavacano and its related languages, Spanish, Portuguese, Cebuano, and Hiligaynon, from the number of common words in the corpus for all languages. We report an accuracy of 93% for the model generated using ten filters with a filter width of 5. The training experiments reveal that increasing the filter width, number of filters, or training epochs is unnecessary even if the accuracy increases because the generated models present irregular learning behavior or may have already been overfitted. This study also demonstrates that the character features extracted from convolutional neural networks, similar to n-grams, are sufficient in identifying Chavacano. Future work on the language identification of Chavacano includes improving classification accuracy for short or code-switched texts for practical applications such as social media sensors for disaster response and management.

2023

pdf bib
Practical Approaches for Low-Resource Named Entity Recognition of Filipino Telecommunications Domain
Kyle Chan | Kaye Ann De Las Alas | Charles Orcena | Dan John Velasco | Qyle John San Juan | Charibeth Cheng
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Towards Automatic Construction of Filipino WordNet: Word Sense Induction and Synset Induction Using Sentence Embeddings
Dan John Velasco | Axel Alba | Trisha Gail Pelagio | Bryce Anthony Ramirez | Jan Christian Blaise Cruz | Unisse Chua | Briane Paul Samson | Charibeth Cheng
Proceedings of the First Workshop in South East Asian Language Processing

pdf bib
Balarila: Deep Learning for Semantic Grammar Error Correction in Low-Resource Settings
Andre Dominic H. Ponce | Joshue Salvador A. Jadie | Paolo Edni Andryn Espiritu | Charibeth Cheng
Proceedings of the First Workshop in South East Asian Language Processing

2022

pdf bib
Improving Large-scale Language Models and Resources for Filipino
Jan Christian Blaise Cruz | Charibeth Cheng
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we improve on existing language resources for the low-resource Filipino language in two ways. First, we outline the construction of the TLUnified dataset, a large-scale pretraining corpus that serves as an improvement over smaller existing pretraining datasets for the language in terms of scale and topic variety. Second, we pretrain new Transformer language models following the RoBERTa pretraining technique to supplant existing models trained with small corpora. Our new RoBERTa models show significant improvements over existing Filipino models in three benchmark datasets with an average gain of 4.47% test accuracy across three classification tasks with varying difficulty.

2020

pdf bib
Bridging Philippine Languages With Multilingual Neural Machine Translation
Renz Iver Baliber | Charibeth Cheng | Kristine Mae Adlaon | Virgion Mamonong
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

The Philippines is home to more than 150 languages that is considered to be low-resourced even on its major languages. This results into a lack of pursuit in developing a translation system for the underrepresented languages. To simplify the process of developing translation system for multiple languages, and to aid in improving the translation quality of zero to low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregards the analysis of a multilingual model on a closely related and low-resource language group in the context of pivot-based translation and zero-shot translation. In this paper, we benchmarked translation for several Philippine Languages, provided an analysis of a multilingual NMT system for morphologically rich and low-resource languages in terms of its effectiveness in translating zero-resource languages with zero-shot translations. To further evaluate the capability of the multilingual NMT model in translating unseen language pairs in training, we tested the model to translate between Tagalog and Cebuano and compared its performance with a simple NMT model that is directly trained on a parallel Tagalog and Cebuano data in which we showed that zero-shot translation outperforms a directly trained model in some instances, while utilizing English as a pivot language in translating outperform both approaches.

pdf bib
Localization of Fake News Detection via Multitask Transfer Learning
Jan Christian Blaise Cruz | Julianne Agatha Tan | Charibeth Cheng
Proceedings of the Twelfth Language Resources and Evaluation Conference

The use of the internet as a fast medium of spreading fake news reinforces the need for computational tools that combat it. Techniques that train fake news classifiers exist, but they all assume an abundance of resources including large labeled datasets and expert-curated corpora, which low-resource languages may not have. In this work, we make two main contributions: First, we alleviate resource scarcity by constructing the first expertly-curated benchmark dataset for fake news detection in Filipino, which we call “Fake News Filipino.” Second, we benchmark Transfer Learning (TL) techniques and show that they can be used to train robust fake news classifiers from little data, achieving 91% accuracy on our fake news dataset, reducing the error by 14% compared to established few-shot baselines. Furthermore, lifting ideas from multitask learning, we show that augmenting transformer-based transfer techniques with auxiliary language modeling losses improves their performance by adapting to writing style. Using this, we improve TL performance by 4-6%, achieving an accuracy of 96% on our best model. Lastly, we show that our method generalizes well to different types of news articles, including political news, entertainment news, and opinion articles.

2019

pdf bib
Annotation Process for the Dialog Act Classification of a Taglish E-commerce Q&A Corpus
Jared Rivera | Jan Caleb Oliver Pensica | Jolene Valenzuela | Alfonso Secuya | Charibeth Cheng
Proceedings of the Second Workshop on Economics and Natural Language Processing

With conversational agents or chatbots making up in quantity of replies rather than quality, the need to identify user intent has become a main concern to improve these agents. Dialog act (DA) classification tackles this concern, and while existing studies have already addressed DA classification in general contexts, no training corpora in the context of e-commerce is available to the public. This research addressed the said insufficiency by building a text-based corpus of 7,265 posts from the question and answer section of products on Lazada Philippines. The SWBD-DAMSL tagset for DA classification was modified to 28 tags fitting the categories applicable to e-commerce conversations. The posts were annotated manually by three (3) human annotators and preprocessing techniques decreased the vocabulary size from 6,340 to 1,134. After analysis, the corpus was composed dominantly of single-label posts, with 34% of the corpus having multiple intent tags. The annotated corpus allowed insights toward the structure of posts created with single to multiple intents.

2018

pdf bib
Modeling Personality Traits of Filipino Twitter Users
Edward Tighe | Charibeth Cheng
Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media

Recent studies in the field of text-based personality recognition experiment with different languages, feature extraction techniques, and machine learning algorithms to create better and more accurate models; however, little focus is placed on exploring the language use of a group of individuals defined by nationality. Individuals of the same nationality share certain practices and communicate certain ideas that can become embedded into their natural language. Many nationals are also not limited to speaking just one language, such as how Filipinos speak Filipino and English, the two national languages of the Philippines. The addition of several regional/indigenous languages, along with the commonness of code-switching, allow for a Filipino to have a rich vocabulary. This presents an opportunity to create a text-based personality model based on how Filipinos speak, regardless of the language they use. To do so, data was collected from 250 Filipino Twitter users. Different combinations of data processing techniques were experimented upon to create personality models for each of the Big Five. The results for both regression and classification show that Conscientiousness is consistently the easiest trait to model, followed by Extraversion. Classification models for Agreeableness and Neuroticism had subpar performances, but performed better than those of Openness. An analysis on personality trait score representation showed that classifying extreme outliers generally produce better results for all traits except for Neuroticism and Openness.

2009

pdf bib
Philippine Language Resources: Trends and Directions
Rachel Edita Roxas | Charibeth Cheng | Nathalie Rose Lim
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

2008

pdf bib
Natural Language Database Interface for the Community Based Monitoring System
Krissanne Kaye Garcia | Ma. Angelica Lumain | Jose Antonio Wong | Jhovee Gerard Yap | Charibeth Cheng
Proceedings of the 22nd Pacific Asia Conference on Language, Information and Computation