Jaihyun Park

2025

A Data-driven Investigation of Euphemistic Language: Comparing the usage of “slave” and “servant” in 19th century US newspapers
Jaihyun Park | Ryan Cordell
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Warning: This paper contains examples of offensive language targeting marginalized populations. This study investigates the usage of “slave” and “servant” in 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to “slave” and “servant” and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that “slave” is associated with socio-economic, legal, and administrative words, however, “servant” is linked to religious words in the Northern newspapers while Southern newspapers associated “servant” with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

2023

pdf bib abs

A Quantitative Discourse Analysis of Asian Workers in the US Historical Newspapers
Jaihyun Park | Ryan Cordell
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

The digitization of historical texts invites researchers to explore the large-scale corpus of historical texts with computational methods. In this study, we present computational text analysis on a relatively understudied topic of how Asian workers are represented in historical newspapers in the United States. We found that the word “coolie” was semantically different in some States (e.g., Massachusetts, Rhode Island, Wyoming, Oklahoma, and Arkansas) with the different discourses around coolie. We also found that then-Confederate newspapers and then-Union newspapers formed distinctive discourses by measuring over-represented words. Newspapers from then-Confederate States associated coolie with slavery-related words. In addition, we found Asians were perceived to be inferior to European immigrants and subjected to the target of racism. This study contributes to supplementing the qualitative analysis of racism in the United States with quantitative discourse analysis.

pdf bib abs

Understanding Gender Stereotypes in Video Game Character Designs: A Case Study of Honor of Kings
Bingqing Liu | Kyrie Zhixuan Zhou | Danlei Zhu | Jaihyun Park
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

In this paper, we conduct a comprehensive analysis of gender stereotypes in the character design in Honor of Kings, a popular MOBA game in China. We probe gender stereotypes through the lens of role assignments, visual designs, lines, and background stories, combining qualitative analysis and text mining based on moral foundations. Male heroes are commonly designed as masculine fighters with power, and female heroes are designed as feminine “ornaments” with ideal looks. We contribute with a multi-modal dataset for understanding gender bias in games and a moral-, visual-, and role-based inspection of gender.

2022

pdf bib abs

Raison d’être of the benchmark dataset: A Survey of Current Practices of Benchmark Dataset Sharing Platforms
Jaihyun Park | Sullam Jeoung
Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP

This paper critically examines the current practices of benchmark dataset sharing in NLP and suggests a better way to inform reusers of the benchmark dataset. As the dataset sharing platform plays a key role not only in distributing the dataset but also in informing the potential reusers about the dataset, we believe data-sharing platforms should provide a comprehensive context of the datasets. We survey four benchmark dataset sharing platforms: HuggingFace, PaperswithCode, Tensorflow, and Pytorch to diagnose the current practices of how the dataset is shared which metadata is shared and omitted. To be specific, drawing on the concept of data curation which considers the future reuse when the data is made public, we advance the direction that benchmark dataset sharing platforms should take into consideration. We identify that four benchmark platforms have different practices of using metadata and there is a lack of consensus on what social impact metadata is. We believe the problem of missing a discussion around social impact in the dataset sharing platforms has to do with the failed agreement on who should be in charge. We propose that the benchmark dataset should develop social impact metadata and data curator should take a role in managing the social impact metadata.

Co-authors

Venues

Fix author