Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Mika Hämäläinen, Emily Öhman, Flammie Pirinen, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni, Niko Partanen, Jack Rueter (Editors)


Anthology ID:
2023.nlp4dh-1
Month:
December
Year:
2023
Address:
Tokyo, Japan
Venues:
NLP4DH | IWCLUL
SIG:
SIGUR
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2023.nlp4dh-1/
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2023.nlp4dh-1.pdf

pdf bib
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages
Mika Hämäläinen | Emily Öhman | Flammie Pirinen | Khalid Alnajjar | So Miyagawa | Yuri Bizzoni | Niko Partanen | Jack Rueter

pdf bib
Emotion-based Morality in Tagalog and English Scenarios (EMoTES-3K): A Parallel Corpus for Explaining (Im)morality of Actions
Jasper Kyle Catapang | Moses Visperas

Grasping morality is vital in AI systems, particularly as they become more prevalent in human-focused applications. Yet, research is scarce on this topic. This study presents the Emotion-based Morality in Tagalog and English Scenarios (EMoTES-3K), a collection that shows commonsense morality in both Filipino and English. This dataset is instrumental for analyzing moral decisions in various situations and their justifications. Our tests show that EMoTES-3K is effective for moral text categorization, with the fine-tuned RoBERTa model scoring 94.95% accuracy in English and 88.53% in Filipino. The dataset also excels in text generation tasks, as shown by fine-tuning the FLAN-T5 model to produce clear moral explanations. However, the model faces challenges when dealing with actions that have mixed moral implications. This work not only bridges the gap in moral reasoning datasets for languages like Filipino but also sets the stage for future research in commonsense moral reasoning in artificial intelligence.

pdf bib
A Quantitative Discourse Analysis of Asian Workers in the US Historical Newspapers
Jaihyun Park | Ryan Cordell

The digitization of historical texts invites researchers to explore the large-scale corpus of historical texts with computational methods. In this study, we present computational text analysis on a relatively understudied topic of how Asian workers are represented in historical newspapers in the United States. We found that the word “coolie” was semantically different in some States (e.g., Massachusetts, Rhode Island, Wyoming, Oklahoma, and Arkansas) with the different discourses around coolie. We also found that then-Confederate newspapers and then-Union newspapers formed distinctive discourses by measuring over-represented words. Newspapers from then-Confederate States associated coolie with slavery-related words. In addition, we found Asians were perceived to be inferior to European immigrants and subjected to the target of racism. This study contributes to supplementing the qualitative analysis of racism in the United States with quantitative discourse analysis.

pdf bib
Revisiting Authorship Attribution of Tirant lo Blanc Using Parts of Speech n-grams
Yoshifumi Kawasaki

Tirant lo Blanc (TLB) is a masterpiece of medieval Catalan chivalric romance. Regarding its authorship, two hypotheses exist: the single-authorship hypothesis claims in agreement with the dedication that Joanot Martorell is the sole author, whereas the dual-authorship hypothesis alleges in line with the colophon that Martorell wrote the first three parts and Martí Joan de Galba added the fourth part. In this study, we revisit the unsettled authorship attribution of TLB with stylometric techniques; specifically, we exploit parts-of-speech (POS) n-grams as stylistic features to investigate stylistic differences (if any) across the work. Furthermore, we address the distinction between narration and conversation, which has previously been omitted. We performed exploratory multivariate analyses and demonstrated that, despite internal differences, single-authorship is more likely from a statistical point of view. If Galba had contributed something to the last quarter of the work, it would have been minimal.

pdf bib
Translation from Historical to Contemporary Japanese Using Japanese T5
Hisao Usui | Kanako Komiya

This paper presents machine translation from historical Japanese to contemporary Japanese using a Text-to-Text Transfer Transformer (T5). The result of the previous study that used neural machine translation (NMT), Long Short Term Memory (LSTM), could not outperform that of the work that used statistical machine translation (SMT). Because an NMT model tends to require more training data than an SMT model, the lack of parallel data of historical and contemporary Japanese could be the reason. Therefore, we used Japanese T5, a kind of large language model to compensate for the lack of data. Our experiments show that the translation with T5 is slightly lower than SMT. In addition, we added the title of the literature book from which the example sentence was extracted at the beginning of the input. Japanese historical corpus consists of a variety of texts ranging in periods when the texts were written and the writing styles. Therefore, we expected that the title gives information about the period and style, to the translation model. Additional experiments revealed that, with title information, the translation from historical Japanese to contemporary Japanese with T5 surpassed that with SMT.

pdf bib
Measuring the distribution of Hume’s Scotticisms in the ECCO collection
Iiro Tiihonen | Aatu Liimatta | Lidia Pivovarova | Tanja Säily | Mikko Tolonen

This short paper studies the distribution of Scotticisms from a list compiled by David Hume in a large collection of 18th century publications. We use regular expression search to find the items on the list in the ECCO collection, and then apply regression analysis to test whether the distribution of Scotticisms in works first published in Scotland is significantly different from the distribution of Scotticisms in works first published in England. We further refine our analysis to trace the influence of variables such as publication date, genre and author’s country of origin.

pdf bib
Effect of data quality on the automated identification of register features in Eighteenth Century Collections Online
Aatu Liimatta

Many large-scale investigations of textual data are based on the automated identification of various linguistic features. However, if the textual data is of lower quality, automated identification of linguistic features, particularly more complex ones, can be severely hampered. Data quality problems are particularly prominent with large datasets of historical text which have been made machine-readable using optical character recognition (OCR) technology, but it is unclear how much the identification of individual linguistic features is affected by the dirty OCR, and how features of varying complexity are influenced differently. In this paper, I analyze the effect of OCR quality on the automated identification of the set of linguistic features commonly used for multi-dimensional register analysis (MDA) by comparing their observed frequencies in the OCR-processed Eighteenth Century Collections Online (ECCO) and a clean baseline (ECCO-TCP). The results show that the identification of most features is disturbed more as the OCR quality decreases, but different features start degrading at different OCR quality levels and do so at different rates.

pdf bib
Automated Generation of Multiple-Choice Cloze Questions for Assessing English Vocabulary Using GPT-turbo 3.5
Qiao Wang | Ralph Rose | Naho Orita | Ayaka Sugawara

A common way of assessing language learners’ mastery of vocabulary is via multiple-choice cloze (i.e., fill-in-the-blank) questions. But the creation of test items can be laborious for individual teachers or in large-scale language programs. In this paper, we evaluate a new method for automatically generating these types of questions using large language models (LLM). The VocaTT (vocabulary teaching and training) engine is written in Python and comprises three basic steps: pre-processing target word lists, generating sentences and candidate word options using GPT, and finally selecting suitable word options. To test the efficiency of this system, 60 questions were generated targeting academic words. The generated items were reviewed by expert reviewers who judged the well-formedness of the sentences and word options, adding comments to items judged not well-formed. Results showed a 75% rate of well-formedness for sentences and 66.85% rate for suitable word options. This is a marked improvement over the generator used earlier in our research which did not take advantage of GPT’s capabilities. Post-hoc qualitative analysis reveals several points for improvement in future work including cross-referencing part-of-speech tagging, better sentence validation, and improving GPT prompts.

pdf bib
Explicit References to Social Values in Fairy Tales: A Comparison between Three European Cultures
Alba Morollon Diaz-Faes | Carla Murteira | Martin Ruskov

The study of social values in fairy tales opens the possibility to learn about the communication of values across space and time. We propose to study the communication of values in fairy tales from Portugal, Italy and Germany using a technique called word embedding with a compass to quantify vocabulary differences and commonalities. We study how these three national traditions of fairy tales differ in their explicit references to values. To do this, we specify a list of value-charged tokens, consider their word stems and analyse the distance between these in a bespoke pre-trained Word2Vec model. We triangulate and critically discuss the validity of the resulting hypotheses emerging from this quantitative model. Our claim is that this is a reusable and reproducible method for the study of the values explicitly referenced in historical corpora. Finally, our preliminary findings hint at a shared cultural understanding and the expression of values such as Benevolence, Conformity, and Universalism across European societies, suggesting the existence of a pan-European cultural memory.

pdf bib
The Stylometry of Maoism: Quantifying the Language of Mao Zedong
Maciej Kurzynski

Recent advances in computational stylometry have enabled scholars to detect authorial signals with a high degree of precision, but the focus on accuracy comes at the expense of explainability: powerful black-box models are often of little use to traditional humanistic disciplines. With this in mind, we have conducted stylometric experiments on Maospeak, a language style shaped by the writings and speeches of Mao Zedong. We measure per-token perplexity across different GPT models, compute Kullback–Leibler divergences between local and global vocabulary distributions, and train a TF-IDF classifier to examine how the modern Chinese language has been transformed to convey the tenets of Maoist doctrine. We offer a computational interpretation of ideology as reduction in perplexity and increase in systematicity of language use.

pdf bib
Efficient and reliable utilization of automated data collection applied to news on climate change
Erkki Mervaala | Jari Lyytimäki

Automated data collection provides tempting opportunities for social sciences and humanities studies. Abundant data accumulating in various digital archives allows more comprehensive, timely and cost-efficient ways of harvesting and processing information. While easing or even removing some of the key problems, such as laborious and time-consuming data collection and potential errors and biases related to subjective coding of materials and distortions caused by focus on small samples, automated methods also bring in new risks such as poor understanding of contexts of the data or non-recognition of underlying systematic errors or missing information. Results from testing different methods to collect data describing newspaper coverage of climate change in Finland emphasize that fully relying on automatable tools such as media scrapers has its limitations and can provide comprehensive but incomplete document acquisition for research. Many of these limitations can, however, be addressed and not all of them rely on manual control.

pdf bib
Unlocking Transitional Chinese: Word Segmentation in Modern Historical Texts
Baptiste Blouin | Hen-Hsen Huang | Christian Henriot | Cécile Armand

This research addresses Natural Language Processing (NLP) tokenization challenges for transitional Chinese, which lacks adequate digital resources. The project used a collection of articles from the Shenbao, a newspaper from this period, as their study base. They designed models tailored to transitional Chinese, with goals like historical information extraction, large-scale textual analysis, and creating new datasets for computational linguists. The team manually tokenized historical articles to understand the language’s linguistic patterns, syntactic structures, and lexical variations. They developed a custom model tailored to their dataset after evaluating various word segmentation tools. They also studied the impact of using pre-trained language models on historical data. The results showed that using language models aligned with the source languages resulted in superior performance. They assert that transitional Chinese they are processing is more related to ancient Chinese than contemporary Chinese, necessitating the training of language models specifically on their data. The study’s outcome is a model that achieves a performance of over 83% and an F-score that is 35% higher than using existing tokenization tools, signifying a substantial improvement. The availability of this new annotated dataset paves the way for refining the model’s performance in processing this type of data.

pdf bib
Introducing ChatGPT to a researcher’s toolkit: An empirical comparison between rule-based and large language model approach in the context of qualitative content analysis of political texts in Finnish
Ilona Kousa

Large Language Models, such as ChatGPT, offer numerous possibilities and prospects for academic research. However, there has been a gap in empirical research regarding their utilisation as keyword extraction and classification tools in qualitative research; perspectives from the social sciences and humanities have been notably limited. Moreover, Finnish-language data have not been used in previous studies. In this article, I aim to address these gaps by providing insights into the utilisation of ChatGPT and drawing comparisons with a rule-based Natural Language Processing method called Etuma. I will focus on assessing the effectiveness of classification and the methods’ adherence to scientific principles. The findings of the study indicate that the classic recall and precision trade-off applies to the methods: ChatGPT’s precision is high, but its recall is comparatively low, while the results are the opposite for Etuma. I also discuss the implications of the results and outline ideas for leveraging the strengths of both methods in future studies.

pdf bib
Fly, fly little Comet! Exploring Subtoken-Level Metaphorical Patterns in Finnish and Hungarian Texts. New Results from the FiHuComet Corpus.
Tímea Borbála Bajzát

The FiHuComet Corpus was created to address the gap in the lack of a systematic comparison of metaphor research in Finnish and Hungarian (Bajzát and Simon, 2023). This study aims to: (i) expand the existing quasi-parallel corpus; (ii) explore subtoken-level metaphorical patterns comparatively in the examined languages with rich morphology. The analysis employs a MIPVU-inspired protocol for metaphor identification, the MetaID protocol (Simon et al., 2023). Although this endeavor is not new, the comparative study conducted on a small-scale corpus has only revealed a few aspects of the potential of comparative metaphor analysis in the context of Finno-Ugric languages selected.

pdf bib
Machine Translation for Highly Low-Resource Language: A Case Study of Ainu, a Critically Endangered Indigenous Language in Northern Japan
So Miyagawa

This paper explores the potential of Machine Translation (MT) in preserving and revitalizing Ainu, an indigenous language of Japan classified as critically endangered by UNESCO. Through leveraging Marian MT, an open-source Neural Machine Translation framework, this study addresses the challenging linguistic features of Ainu and the limitations of available resources. The research implemented a meticulous methodology involving rigorous preprocessing of data, prudent training of the model, and robust evaluation using the SacreBLEU metric. The findings underscore the system’s efficacy, achieving a SacreBLEU score of 32.90 for Japanese to Ainu translation. This promising result highlights the capacity of MT systems to support language preservation and aligns with recent research emphasizing the potential of computational techniques for low-resource languages. The paper concludes by affirming the significant role of MT in the broader context of language preservation, serving as a crucial tool in the fight against language extinction. The study paves the way for future research to harness advanced MT techniques and develop more sophisticated models for endangered languages.

pdf bib
Understanding Gender Stereotypes in Video Game Character Designs: A Case Study of Honor of Kings
Bingqing Liu | Kyrie Zhixuan Zhou | Danlei Zhu | Jaihyun Park

In this paper, we conduct a comprehensive analysis of gender stereotypes in the character design in Honor of Kings, a popular MOBA game in China. We probe gender stereotypes through the lens of role assignments, visual designs, lines, and background stories, combining qualitative analysis and text mining based on moral foundations. Male heroes are commonly designed as masculine fighters with power, and female heroes are designed as feminine “ornaments” with ideal looks. We contribute with a multi-modal dataset for understanding gender bias in games and a moral-, visual-, and role-based inspection of gender.

pdf bib
The Great Digital Humanities Disconnect: The Failure of DH Publishing
Emily Öhman | Michael Piotrowski | Mika Hämäläinen

In this paper, we discuss the disconnect in interdisciplinary publishing from a disciplinary divide perspective as to how research is expected to be presented and published according to disciplinary conventions. We argue that this divide hinders interdisciplinary collaboration and even more so the dissemination of research results from interdisciplinary projects to other interdisciplinary researchers. The disconnect is not simply theoretical but also encompasses practical considerations such as manuscript creation standards. The disconnect can also be detrimental to academic careers in terms of evaluations by peers on funding and tenure committees as well as peer reviews. With this analysis, we want to foster further discussion about the state of academic publishing from a digital humanities perspective.

pdf bib
Explorative study on verbalizing students’ skills with NLP/AI-tool in Digital Living Lab at Laurea UAS, Finland
Asko Mononen

This explorative study tested Laurea UAS students’ (N=16) abilities to verbalize their skills, before and after the study unit “Digital Analytics and Consumer Insights”. Before the study unit the students listed their skills unaided and afterwards with help of Careerbot AI -service. The findings indicate that the intervention increased both quantity and quality of the skills verbalized, relevant to the learning objectives and generic, 21st century skills.

pdf bib
Combating Hallucination and Misinformation: Factual Information Generation with Tokenized Generative Transformer
Sourav Das | Sanjay Chatterji | Imon Mukherjee

Large language models have gained a meteoric rise recently. With the prominence of LLMs, hallucination and misinformation generation have become a severity too. To combat this issue, we propose a contextual topic modeling approach called Co-LDA for generative transformer. It is based on Latent Dirichlet Allocation and is designed for accurate sentence-level information generation. This method extracts cohesive topics from COVID-19 research literature, grouping them into relevant categories. These contextually rich topic words serve as masked tokens in our proposed Tokenized Generative Transformer, a modified Generative Pre-Trained Transformer for generating accurate information in any designated topics. Our approach addresses micro hallucination and incorrect information issues in experimentation with the LLMs. We also introduce a Perplexity-Similarity Score system to measure semantic similarity between generated and original documents, offering accuracy and authenticity for generated texts. Evaluation of benchmark datasets, including question answering, language understanding, and language similarity demonstrates the effectiveness of our text generation method, surpassing some state-of-the-art transformer models.

pdf bib
Statistical Measures for Readability Assessment
Mohammed Attia | Younes Samih | Yo Ehara

Neural models and deep learning techniques have predominantly been used in many tasks of natural language processing (NLP), including automatic readability assessment (ARA). They apply deep transfer learning and enjoy high accuracy. However, most of the models still cannot leverage long dependence such as inter-sentential topic-level or document-level information because of their structure and computational cost. Moreover, neural models usually have low interpretability. In this paper, we propose a generalization of passage-level, corpus-level, document-level and topic-level features. In our experiments, we show the effectiveness of “Statistical Lexical Spread (SLS)” features when combined with IDF (inverse document frequency) and TF-IDF (term frequency–inverse document frequency), which adds a topological perspective (inter-document) to readability to complement the typological approaches (intra-document) used in traditional readability formulas. Interestingly, simply adding these features in BERT models outperformed state-of-the-art systems trained on a large number of hand-crafted features derived from heavy linguistic processing. In analysis, we show that SLS is also easy-to-interpret because SLS computes lexical features, which appear explicitly in texts, compared to parameters in neural models.

pdf bib
A Question of Confidence: Using OCR Technology for Script analysis
Antonia Karaisl

The following article proposes a method employing the Tesseract OCR engine to aid palaeographic analysis and scribal identification. Repurposing the so-called confidence score provided by the OCR engine, different methods of visualization are used to surface differences between font families, script types and manuscript hands.

pdf bib
Emil.RuleZ! – An exploratory pilot study of handling a real-life longitudinal email archive
Balázs Indig | Luca Horváth | Dorottya Henrietta Szemigán | Mihály Nagy

An entire generation that predominantly used email for official communication throughout their lives is about to leave behind a significant amount of preservable digital heritage. Memory institutions in the USA (e.g. Internet Archive, Stanford University Library) recognised this endeavor of preservation early on, therefore, available solutions are focused on English language public archives, neglecting the problem of different languages with different encodings in a single archive and the heterogeneity of standards that have changed considerably since their first form in the 1970s. Since online services enable the convenient creation of email archives in MBOX format it is important to evaluate how existing tools handle non-homogeneous longitudinal archives containing diverse states of email standards, as opposed to often archived monolingual public mailing lists, and how such data can be made ready for research. We use distant reading methods on a real-life archive, the legacy of a deceased individual containing 11,245 emails from 2010 to 2023 in multiple languages and encodings, and demonstrate how existing available tools can be surpassed. Our goal is to enhance data homogeneity to make it accessible for researchers in a queryable database format. We utilise rule-based methods and GPT-3.5 to extract the cleanest form of our data.

pdf bib
Banning of ChatGPT from Educational Spaces: A Reddit Perspective
Nicole Miu Takagi

With the introduction of ChatGPT on November 30, 2022, the online sphere was disrupted seemingly overnight, with its ability to generate human-like text and comprehensively answer questions. It has even been lauded as being able to aid in the editing and generation of code. Some schools and online question-and-answer forums, however, have banned its use. In this paper, we use Reddit data to examine the impact that the banning of the AI tool has had online. Our findings indicate that reactions have ranged from skepticism that the ban will work, loss of educational opportunity, to agreement that ChatGPT is not 100 percent accurate in its answers. We postulate that while it may be better to ban it from Question-and-Answer forums, in physical classrooms, while ChatGPT may hinder students from finding their own solutions to problems, it also provides the opportunity for students to critically view answers provided to them by the chatbot, strengthening their digital literacy and critical thinking skills.

pdf bib
Girlbosses, The Red Pill, and the Anomie and Fatale of Gender Online: Analyzing Posts from r/SuicideWatch on Reddit
Elissa Nakajima Wickham

The proliferation of social media use in daily life has introduced a new practice in today’s society: posting about suicidal ideation or intent online. Recent trends in social media reflect a movement towards different forms of male and female empowerment that impact gender norms, and thus, may impact social categorization. This pilot study explores posts from r/SuicideWatch that include discussions of gender and its implications for online conceptions of social identity. We use computational methods borrowed from natural language processing to analyze this impact from a novel perspective rarely seen in sociology.

pdf bib
Bootstrapping Moksha-Erzya Neural Machine Translation from Rule-Based Apertium
Khalid Alnajjar | Mika Hämäläinen | Jack Rueter

Neural Machine Translation (NMT) has made significant strides in breaking down language barriers around the globe. For lesser-resourced languages like Moksha and Erzya, however, the development of robust NMT systems remains a challenge due to the scarcity of parallel corpora. This paper presents a novel approach to address this challenge by leveraging the existing rule-based machine translation system Apertium as a tool for synthetic data generation. We fine-tune NLLB-200 for Moksha-Erzya translation and obtain a BLEU of 0.73 on the Apertium generated data. On real world data, we got an improvement of 0.058 BLEU score over Apertium.

pdf bib
Comparing Transformer and Dictionary-based Sentiment Models for Literary Texts: Hemingway as a Case-study
Yuri Bizzoni | Pascale Feldkamp

The literary domain continues to pose a challenge for Sentiment Analysis methods, due to its particularly nuanced and layered nature. This paper explores the adequacy of different Sentiment Analysis tools - from dictionary-based approaches to state-of-the-art Transformers - for capturing valence and modelling sentiment arcs. We take Ernest Hemingway’s novel The Old Man and the Sea as a case study to address challenges inherent to literary language, compare Transformer and rule-based systems’ scores with human annotations, and shed light on the complexities of analyzing sentiment in narrative texts. Finally, we emphasize the potential of model ensembles.

pdf bib
Study on the Domain Adaption of Korean Speech Act using Daily Conversation Dataset and Petition Corpus
Youngsook Song | Won Ik Cho

In Korean, quantitative speech act studies have usually been conducted on single utterances with unspecified sources. In this study, we annotate sentences from the National Institute of Korean Language’s Messenger Corpus and the National Petition Corpus, as well as example sentences from an academic paper on contemporary Korean vlogging, and check the discrepancy between human annotation and model prediction. In particular, for sentences with differences in locutionary and illocutionary forces, we analyze the causes of errors to see if stylistic features used in a particular domain affect the correct inference of speech act. Through this, we see the necessity to build and analyze a balanced corpus in various text domains, taking into account cases with different usage roles, e.g., messenger conversations belonging to private conversations and petition corpus/vlogging script that have an unspecified audience.

pdf bib
Readability and Complexity: Diachronic Evolution of Literary Language Across 9000 Novels
Pascale Feldkamp | Yuri Bizzoni | Ida Marie S. Lassen | Mads Rosendahl Thomsen | Kristoffer Nielbo

Using a large corpus of English language novels from 1880 to 2000, we compare several textual features associated with literary quality, seeking to examine developments in literary language and narrative complexity through time. We show that while we find a correlation between the features, readability metrics are the only ones that exhibit a steady evolution, indicating that novels become easier to read through the 20th century but not simpler. We discuss the possibility of cultural selection as a factor and compare our findings with a subset of canonical works.

pdf bib
Bridging the Gap: Demonstrating the Applicability of Linguistic Analysis Tools in Digital Musicology
Sebastian Oliver Eck

This study introduces the novel concepts of Explicit and Implicit Musical Parameters (EMPs and IMPs) and demonstrates their application in digital musicology. Furthermore, it discusses the concept of ‘musical words’, that suggests representing explicit and implicit musical parameters as words or textual entities. This ‘music-to-text’ approach allows the application of advanced techniques and tools commonly used within the computational linguistics for the analysis of musical data, highlighting the structural parallels between music and language. Lastly, the findings of this paper not only illustrate the feasibility of this approach but also pave the way for further interdisciplinary studies and the advancement of analytical user-friendly tools that are applicable in both computational linguistics and digital musicology.

pdf bib
MITRA-zh: An efficient, open machine translation solution for Buddhist Chinese
Sebastian Nehrdich | Marcus Bingenheimer | Justin Brody | Kurt Keutzer

Buddhist Classical Chinese is a challenging low-resource language that has not yet received much dedicated attention in NLP research. Standard commercial machine translation software performs poorly on this idiom. In order to address this gap, we present a novel dataset of 209,454 bitext pairs for the training and 2.300 manually curated and corrected bitext pairs for the evaluation of machine translation models. We finetune a number of encoder-decoder models on this dataset and compare their performance against commercial models. We show that our best fine-tuned model outperforms the currently available commercial solutions by a considerable margin while being much more cost-efficient and faster in deployment. This is especially important for digital humanities, where large amounts of data need to be processed efficiently for corpus-level operations such as topic modeling or semantic search. We also show that the commercial chat system GPT4 is surprisingly strong on this task, at times reaching comparable performance to our finetuned model and clearly outperforming standard machine translation providers. We provide a limited case study where we examine the performance of selected different machine translation models on a number of Buddhist Chinese passages in order to demonstrate what level of quality these models reach at the moment.

pdf bib
Comparison on Heterosexual and Homosexual Woman’s Lonely Heart Ads in Taiwan: Taking AllTogether and Lesbian Board on PTT Web Forum as Examples
Yu-Hsuan Lin

This study aims to compare the lonely heart ads of heterosexual and homosexual women in Taiwan. The data was collected from the AllTogether (heterosexual) and Lesbian (homosexual) boards on the PTT web forum. Word frequency analysis and topic modeling were used to analyze the data. It was found that lesbians tend to state more on emotional and spiritual connection, using words that describe personality traits and changes in emotions. Heterosexual women, on the other hand, showed more concern about practical matters such as religion, occupation, and habits, possibly with the goal of building a family relationship through marriage and starting a family.