Ting-Hao Huang - ACL Anthology

Ting-Hao Huang

Also published as: Ting-Hao ‘Kenneth’ Huang, Ting-Hao Kenneth Huang, Ting-Hao 'Kenneth' Huang

2025

LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
Ho Yin Sam Ng | Edward Hsu | Aashish Anantha Ramakrishnan | Branislav Kveton | Nedim Lipka | Franck Dernoncourt | Dongwon Lee | Tong Yu | Sungchul Kim | Ryan A. Rossi | Ting-Hao Kenneth Huang
Findings of the Association for Computational Linguistics: EMNLP 2025

Figure captions are crucial for helping readers understand and remember a figure’s key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain’s style, highlighting the need for personalization. Despite language models’ personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document—each with its image, caption, and figure-mentioning paragraphs—as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.

From Noise to Nuance: Enriching Subjective Data Annotation through Qualitative Analysis
Ruyuan Wan | Haonan Wang | Ting-Hao Kenneth Huang | Jie Gao
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)

Subjective data annotation (SDA) plays an important role in many NLP tasks, including sentiment analysis, toxicity detection, and bias identification. Conventional SDA often treats annotator disagreement as noise, overlooking its potential to reveal deeper insights. In contrast, qualitative data analysis (QDA) explicitly engages with diverse positionalities and treats disagreement as a meaningful source of knowledge. In this position paper, we argue that human annotators are a key source of valuable interpretive insights into subjective data beyond surface-level descriptions. Through a comparative analysis of SDA and QDA methodologies, we examine similarities and differences in task nature (e.g., human’s role, analysis content, cost, and completion conditions) and practice (annotation schema, annotation workflow, annotator selection, and evaluation). Based on this comparison, we propose five practical recommendations for enabling SDA to capture richer insights. We demonstrate these recommendations in a reinforcement learning from human feedback (RLHF) case study and envision that our interdisciplinary perspective will offer new directions for the field.

Using Contextually Aligned Online Reviews to Measure LLMs’ Performance Disparities Across Language Varieties
Zixin Tang | Chieh-Yang Huang | Tsung-che Li | Ho Yin Sam Ng | Hen-Hsen Huang | Ting-Hao Kenneth Huang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

A language can have different varieties. These varieties can affect the performance of natural language processing (NLP) models, including large language models (LLMs), which are often trained on data from widely spoken varieties. This paper introduces a novel and cost-effective approach to benchmark model performance across language varieties. We argue that international online review platforms,such as Booking.com, can serve as effective data sources for constructing datasets that capture comments in different language varieties from similar real-world scenarios, like reviews for the same hotel with the same rating using the same language (e.g., Mandarin Chinese) but different language varieties (e.g., Taiwan Mandarin, Mainland Mandarin). To prove this concept, we constructed a contextually aligned dataset comprising reviews in Taiwan Mandarin and Mainland Mandarin and tested six LLMs in a sentiment analysis task. Our results show that LLMs consistently underperform in Taiwan Mandarin.

Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)
Vishakh Padmakumar | Katy Gero | Thiemo Wambsganss | Sarah Sterman | Ting-Hao Huang | David Zhou | John Chung
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Understanding Writing Assistants for Scientific Figure Captions: A Thematic Analysis
Ho Yin Sam Ng | Ting-Yao Hsu | Jiyoo Min | Sungchul Kim | Ryan A. Rossi | Tong Yu | Hyunggu Jung | Ting-Hao Kenneth Huang
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Scientific figure captions are essential for communicating complex data but are often overlooked, leading to unclear or redundant descriptions. While many studies focus on generating captions as an ‘output’, little attention has been given to the writer’s process of crafting captions for scientific figures. This study examines how researchers use AI-generated captions to support caption writing. Through thematic analysis of interviews and video recordings with 18 participants from diverse disciplines, we identified four key themes: (1) integrating captions with figures and text, (2) bridging gaps between language proficiency and domain expertise, (3) leveraging multiple AI-generated suggestions, and (4) adapting to diverse writing norms. These findings provide actionable design insights for developing AI writing assistants that better support researchers in creating effective scientific figure captions.

Active Learning (AL) has been a powerful paradigm for improving model efficiency and performance by selecting the most informative data points for labeling and training. In recent active learning frameworks, Large Language Models (LLMs) have been employed not only for selection but also for generating entirely new data instances and providing more cost-effective annotations. Motivated by the increasing importance of high-quality data and efficient model training in the era of LLMs, we present a comprehensive survey on LLM-based Active Learning. We introduce an intuitive taxonomy that categorizes these techniques and discuss the transformative roles LLMs can play in the active learning loop. We further examine the impact of AL on LLM learning paradigms and its applications across various domains. Finally, we identify open challenges and propose future research directions. This survey aims to serve as an up-to-date resource for researchers and practitioners seeking to gain an intuitive understanding of LLM-based AL techniques and deploy them to new applications.

2024

CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds
Min-Hsuan Yeh | Ruyuan Wan | Ting-Hao Kenneth Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fallacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each comment labeled for fallacy presence and type. We recruited 143 crowd workers to write comments embodying specific fallacy types (e.g., slippery slope) in response to news articles. Recognizing the complexity of this writing task, we built an LLM-powered assistant into the workers’ interface to aid in drafting and refining their comments. Experts rated the writing quality and labeling validity of CoCoLoFa as high and reliable. BERT-based models fine-tuned using CoCoLoFa achieved the highest fallacy detection (F1=0.86) and classification (F1=0.87) performance on its test set, outperforming the state-of-the-art LLMs. Our work shows that combining crowdsourcing and LLMs enables us to more effectively construct datasets for complex linguistic phenomena that crowd workers find challenging to produce on their own.

2023

Nationality Bias in Text Generation
Pranav Narayanan Venkit | Sanjana Gautam | Ruchi Panchanadikar | Ting-Hao Huang | Shomir Wilson
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Little attention is placed on analyzing nationality bias in language models, especially when nationality is highly used as a factor in increasing the performance of social NLP models. This paper examines how a text generation model, GPT-2, accentuates pre-existing societal biases about country-based demonyms. We generate stories using GPT-2 for various nationalities and use sensitivity analysis to explore how the number of internet users and the country’s economic status impacts the sentiment of the stories. To reduce the propagation of biases through large language models (LLM), we explore the debiasing method of adversarial triggering. Our results show that GPT-2 demonstrates significant bias against countries with lower internet users, and adversarial triggering effectively reduces the same.

Summaries as Captions: Generating Figure Captions for Scientific Documents with Automated Text Summarization
Chieh-Yang Huang | Ting-Yao Hsu | Ryan Rossi | Ani Nenkova | Sungchul Kim | Gromit Yeuk-Yin Chan | Eunyee Koh | Clyde Lee Giles | Ting-Hao 'Kenneth' Huang
Proceedings of the 16th International Natural Language Generation Conference

Good figure captions help paper readers understand complex scientific figures. Unfortunately, even published papers often have poorly written captions. Automatic caption generation could aid paper writers by providing good starting captions that can be refined for better quality. Prior work often treated figure caption generation as a vision-to-language task. In this paper, we show that it can be more effectively tackled as a text summarization task in scientific documents. We fine-tuned PEGASUS, a pre-trained abstractive summarization model, to specifically summarize figure-referencing paragraphs (e.g., “Figure 3 shows...”) into figure captions. Experiments on large-scale arXiv figures show that our method outperforms prior vision methods in both automatic and human evaluations. We further conducted an in-depth investigation focused on two key challenges: (i) the common presence of low-quality author-written captions and (ii) the lack of clear standards for good captions. Our code and data are available at: https://github.com/Crowd-AI-Lab/Generating-Figure-Captions-as-a-Text-Summarization-Task.

GPT-4 as an Effective Zero-Shot Evaluator for Scientific Figure Captions
Ting-Yao Hsu | Chieh-Yang Huang | Ryan Rossi | Sungchul Kim | Clyde Lee Giles | Ting-Hao 'Kenneth' Huang
Findings of the Association for Computational Linguistics: EMNLP 2023

There is growing interest in systems that generate captions for scientific figures. However, assessing these systems’ output poses a significant challenge. Human evaluation requires academic expertise and is costly, while automatic evaluation depends on often low-quality author-written captions. This paper investigates using large language models (LLMs) as a cost-effective, reference-free method for evaluating figure captions. We first constructed SCICAP-EVAL, a human evaluation dataset that contains human judgments for 3,600 scientific figure captions, both original and machine-made, for 600 arXiv figures. We then prompted LLMs like GPT-4 and GPT-3 to score (1-6) each caption based on its potential to aid reader understanding, given relevant context such as figure-mentioning paragraphs. Results show that GPT-4, used as a zero-shot evaluator, outperformed all other models and even surpassed assessments made by computer science undergraduates, achieving a Kendall correlation score of 0.401 with Ph.D. students’ rankings.

Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers
Shreya Chandrasekhar | Chieh-Yang Huang | Ting-Hao Huang
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

The rapid growth of scientific publications, particularly during the COVID-19 pandemic, emphasizes the need for tools to help researchers efficiently comprehend the latest advancements. One essential part of understanding scientific literature is research aspect classification, which categorizes sentences in abstracts to Background, Purpose, Method, and Finding. In this study, we investigate the impact of different datasets on model performance for the crowd-annotated CODA-19 research aspect classification task. Specifically, we explore the potential benefits of using the large, automatically curated PubMed 200K RCT dataset and evaluate the effectiveness of large language models (LLMs), such as LLaMA, GPT-3, ChatGPT, and GPT-4. Our results indicate that using the PubMed 200K RCT dataset does not improve performance for the CODA-19 task. We also observe that while GPT-4 performs well, it does not outperform the SciBERT model fine-tuned on the CODA-19 dataset, emphasizing the importance of a dedicated and task-aligned datasets dataset for the target task.

Location-Aware Visual Question Generation with Lightweight Models
Nicholas Suwono | Justin Chen | Tun Hung | Ting-Hao Huang | I-Bin Liao | Yung-Hui Li | Lun-Wei Ku | Shao-Hua Sun
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This work introduces a novel task, location-aware visual question generation (LocaVQG), which aims to generate engaging questions from data relevant to a particular geographical location. Specifically, we represent such location-aware information with surrounding images and a GPS coordinate. To tackle this task, we present a dataset generation pipeline that leverages GPT-4 to produce diverse and sophisticated questions. Then, we aim to learn a lightweight model that can address the LocaVQG task and fit on an edge device, such as a mobile phone. To this end, we propose a method which can reliably generate engaging questions from location-aware information. Our proposed method outperforms baselines regarding human evaluation (e.g., engagement, grounding, coherence) and automatic evaluation metrics (e.g., BERTScore, ROUGE-2). Moreover, we conduct extensive ablation studies to justify our proposed techniques for both generating the dataset and solving the task.

2022

Are Shortest Rationales the Best Explanations for Human Understanding?
Hua Shen | Tongshuang Wu | Wenbo Guo | Ting-Hao Huang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing self-explaining models typically favor extracting the shortest possible rationales — snippets of an input text “responsible for” corresponding output — to explain the model prediction, with the assumption that shorter rationales are more intuitive to humans. However, this assumption has yet to be validated. Is the shortest rationale indeed the most human-understandable? To answer this question, we design a self-explaining model, LimitedInk, which allows users to extract rationales at any target length. Compared to existing baselines, LimitedInk achieves compatible end-task performance and human-annotated rationale agreement, making it a suitable representation of the recent class of self-explaining models. We use LimitedInk to conduct a user study on the impact of rationale length, where we ask human judges to predict the sentiment label of documents based only on LimitedInk-generated rationales with different lengths. We show rationales that are too short do not help humans predict labels better than randomly masked text, suggesting the need for more careful design of the best human rationales.

Multi-VQG: Generating Engaging Questions for Multiple Images
Min-Hsuan Yeh | Vincent Chen | Ting-Hao Huang | Lun-Wei Ku
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Generating engaging content has drawn much recent attention in the NLP community. Asking questions is a natural way to respond to photos and promote awareness. However, most answers to questions in traditional question-answering (QA) datasets are factoids, which reduce individuals’ willingness to answer. Furthermore, traditional visual question generation (VQG) confines the source data for question generation to single images, resulting in a limited ability to comprehend time-series information of the underlying event. In this paper, we propose generating engaging questions from multiple images. We present MVQG, a new dataset, and establish a series of baselines, including both end-to-end and dual-stage architectures. Results show that building stories behind the image sequence enables models togenerate engaging questions, which confirms our assumption that people typically construct a picture of the event in their minds before asking questions. These results open up an exciting challenge for visual-and-language models to implicitly construct a story behind a series of photos to allow for creativity and experience sharing and hence draw attention to downstream applications.

Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)
Ting-Hao 'Kenneth' Huang | Vipul Raheja | Dongyeop Kang | John Joon Young Chung | Daniel Gissin | Mina Lee | Katy Ilonka Gero
Proceedings of the First Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2022)

Learning to Rank Visual Stories From Human Ranking Data
Chi-Yang Hsu | Yun-Wei Chu | Vincent Chen | Kuan-Chieh Lo | Chacha Chen | Ting-Hao Huang | Lun-Wei Ku
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Visual storytelling (VIST) is a typical vision and language task that has seen extensive development in the natural language generation research domain. However, it remains unclear whether conventional automatic evaluation metrics for text generation are applicable on VIST. In this paper, we present the VHED (VIST Human Evaluation Data) dataset, which first re-purposes human evaluation results for automatic evaluation; hence we develop Vrank (VIST Ranker), a novel reference-free VIST metric for story evaluation. We first show that the results from commonly adopted automatic metrics for text generation have little correlation with those obtained from human evaluation, which motivates us to directly utilize human evaluation results to learn the automatic evaluation model. In the experiments, we evaluate the generated texts to predict story ranks using our model as well as other reference-based and reference-free metrics. Results show that Vrank prediction is significantly more aligned to human evaluation than other metrics with almost 30% higher accuracy when ranking story pairs. Moreover, we demonstrate that only Vrank shows human-like behavior in its strong ability to find better stories when the quality gap between two stories is high. Finally, we show the superiority of Vrank by its generalizability to pure textual stories, and conclude that this reuse of human evaluation results puts Vrank in a strong position for continued future advances.

2021

Plot and Rework: Modeling Storylines for Visual Storytelling
Chi-yang Hsu | Yun-Wei Chu | Ting-Hao Huang | Lun-Wei Ku
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Stretch-VST: Getting Flexible With Visual Stories
Chi-yang Hsu | Yun-Wei Chu | Tsai-Lun Yang | Ting-Hao Huang | Lun-Wei Ku
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

In visual storytelling, a short story is generated based on a given image sequence. Despite years of work, most visual storytelling models remain limited in terms of the generated stories’ fixed length: most models produce stories with exactly five sentences because five-sentence stories dominate the training data. The fix-length stories carry limited details and provide ambiguous textual information to the readers. Therefore, we propose to “stretch” the stories, which create the potential to present in-depth visual details. This paper presents Stretch-VST, a visual storytelling framework that enables the generation of prolonged stories by adding appropriate knowledge, which is selected by the proposed scoring function. We propose a length-controlled Transformer to generate long stories. This model introduces novel positional encoding methods to maintain story quality with lengthy inputs. Experiments confirm that long stories are generated without deteriorating the quality. The human evaluation further shows that Stretch-VST can provide better focus and detail when stories are prolonged compared to state of the art. We create a webpage to demonstrate our prolonged capability.

ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.

Semantic Frame Forecast
Chieh-Yang Huang | Ting-Hao Huang
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper introduces Semantic Frame Forecast, a task that predicts the semantic frames that will occur in the next 10, 100, or even 1,000 sentences in a running story. Prior work focused on predicting the immediate future of a story, such as one to a few sentences ahead. However, when novelists write long stories, generating a few sentences is not enough to help them gain high-level insight to develop the follow-up story. In this paper, we formulate a long story as a sequence of “story blocks,” where each block contains a fixed number of sentences (e.g., 10, 100, or 200). This formulation allows us to predict the follow-up story arc beyond the scope of a few sentences. We represent a story block using the term frequencies (TF) of semantic frames in it, normalized by each frame’s inverse document frequency (IDF). We conduct semantic frame forecast experiments on 4,794 books from the Bookcorpus and 7,962 scientific abstracts from CODA-19, with block sizes ranging from 5 to 1,000 sentences. The results show that automated models can forecast the follow-up story blocks better than the random, prior, and replay baselines, indicating the feasibility of the task. We also learn that the models using the frame representation as features outperform all the existing approaches when the block size is over 150 sentences. The human evaluation also shows that the proposed frame representation, when visualized as word clouds, is comprehensible, representative, and specific to humans.

SciCap: Generating Captions for Scientific Figures
Ting-Yao Hsu | C Lee Giles | Ting-Hao Huang
Findings of the Association for Computational Linguistics: EMNLP 2021

Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing – including figure-type classification, sub-figure identification, text normalization, and caption text selection – SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.

Learning Clause Representation from Dependency-Anchor Graph for Connective Prediction
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

Semantic representation that supports the choice of an appropriate connective between pairs of clauses inherently addresses discourse coherence, which is important for tasks such as narrative understanding, argumentation, and discourse parsing. We propose a novel clause embedding method that applies graph learning to a data structure we refer to as a dependency-anchor graph. The dependency anchor graph incorporates two kinds of syntactic information, constituency structure, and dependency relations, to highlight the subject and verb phrase relation. This enhances coherence-related aspects of representation. We design a neural model to learn a semantic representation for clauses from graph convolution over latent representations of the subject and verb phrase. We evaluate our method on two new datasets: a subset of a large corpus where the source texts are published novels, and a new dataset collected from students’ essays. The results demonstrate a significant improvement over tree-based models, confirming the importance of emphasizing the subject and verb phrase. The performance gap between the two datasets illustrates the challenges of analyzing student’s written text, plus a potential evaluation task for coherence modeling and an application for suggesting revisions to students.

FinQA: A Dataset of Numerical Reasoning over Financial Data
Zhiyu Chen | Wenhu Chen | Charese Smiley | Sameena Shah | Iana Borova | Dylan Langdon | Reema Moussa | Matt Beane | Ting-Hao Huang | Bryan Routledge | William Yang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The sheer volume of financial statements makes it difficult for humans to access and analyze a business’s financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset – the first of its kind – should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available at https://github.com/czyssrs/FinQA.

2020

CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset
Ting-Hao Kenneth Huang | Chieh-Yang Huang | Chien-Kuang Cornelia Ding | Yen-Chia Hsu | C. Lee Giles
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

This paper introduces CODA-19, a human-annotated dataset that codes the Background, Purpose, Method, Finding/Contribution, and Other sections of 10,966 English abstracts in the COVID-19 Open Research Dataset. CODA-19 was created by 248 crowd workers from Amazon Mechanical Turk within 10 days, and achieved labeling quality comparable to that of experts. Each abstract was annotated by nine different workers, and the final labels were acquired by majority vote. The inter-annotator agreement (Cohen’s kappa) between the crowd and the biomedical expert (0.741) is comparable to inter-expert agreement (0.788). CODA-19’s labels have an accuracy of 82.2% when compared to the biomedical expert’s labels, while the accuracy between experts was 85.0%. Reliable human annotations help scientists access and integrate the rapidly accelerating coronavirus literature, and also serve as the battery of AI/NLP research, but obtaining expert annotations can be slow. We demonstrated that a non-expert crowd can be rapidly employed at scale to join the fight against COVID-19.

Assessing the Helpfulness of Learning Materials with Inference-Based Learner-Like Agent
Yun-Hsuan Jen | Chieh-Yang Huang | MeiHua Chen | Ting-Hao Huang | Lun-Wei Ku
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many English-as-a-second language learners have trouble using near-synonym words (e.g., small vs.little; briefly vs.shortly) correctly, and often look for example sentences to learn how two nearly synonymous terms differ. Prior work uses hand-crafted scores to recommend sentences but has difficulty in adopting such scores to all the near-synonyms as near-synonyms differ in various ways. We notice that the helpfulness of the learning material would reflect on the learners’ performance. Thus, we propose the inference-based learner-like agent to mimic learner behavior and identify good learning materials by examining the agent’s performance. To enable the agent to behave like a learner, we leverage entailment modeling’s capability of inferring answers from the provided materials. Experimental results show that the proposed agent is equipped with good learner-like behavior to achieve the best performance in both fill-in-the-blank (FITB) and good example sentence selection tasks. We further conduct a classroom user study with college ESL learners. The results of the user study show that the proposed agent can find out example sentences that help students learn more easily and efficiently. Compared to other models, the proposed agent improves the score of more than 17% of students after learning.

2019

Visual Story Post-Editing
Ting-Yao Hsu | Chieh-Yang Huang | Yen-Chia Hsu | Ting-Hao Huang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce the first dataset for human edits of machine-generated visual stories and explore how these collected edits may be used for the visual story post-editing task. The dataset ,VIST-Edit, includes 14,905 human-edited versions of 2,981 machine-generated visual stories. The stories were generated by two state-of-the-art visual storytelling models, each aligned to 5 human-edited versions. We establish baselines for the task, showing how a relatively small set of human edits can be leveraged to boost the performance of large visual storytelling models. We also discuss the weak correlation between automatic evaluation scores and human ratings, motivating the need for new automatic metrics.

Proceedings of the Second Workshop on Storytelling
Francis Ferraro | Ting-Hao ‘Kenneth’ Huang | Stephanie M. Lukin | Margaret Mitchell
Proceedings of the Second Workshop on Storytelling

2018

Proceedings of the First Workshop on Storytelling
Margaret Mitchell | Ting-Hao ‘Kenneth’ Huang | Francis Ferraro | Ishan Misra
Proceedings of the First Workshop on Storytelling

EmotionLines: An Emotion Corpus of Multi-Party Conversations
Chao-Chun Hsu | Sheng-Yeh Chen | Chuan-Chun Kuo | Ting-Hao Huang | Lun-Wei Ku
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

MoodSwipe: A Soft Keyboard that Suggests MessageBased on User-Specified Emotions
Chieh-Yang Huang | Tristan Labetoulle | Ting-Hao Huang | Yi-Pei Chen | Hung-Chen Chen | Vallari Srivastava | Lun-Wei Ku
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present MoodSwipe, a soft keyboard that suggests text messages given the user-specified emotions utilizing the real dialog data. The aim of MoodSwipe is to create a convenient user interface to enjoy the technology of emotion classification and text suggestion, and at the same time to collect labeled data automatically for developing more advanced technologies. While users select the MoodSwipe keyboard, they can type as usual but sense the emotion conveyed by their text and receive suggestions for their message as a benefit. In MoodSwipe, the detected emotions serve as the medium for suggested texts, where viewing the latter is the incentive to correcting the former. We conduct several experiments to show the superiority of the emotion classification models trained on the dialog data, and further to verify good emotion cues are important context for text suggestion.

2016

Sensing Emotions in Text Messages: An Application and Deployment Study of EmotionPush
Shih-Ming Wang | Chun-Hui Scott Lee | Yu-Chun Lo | Ting-Hao Huang | Lun-Wei Ku
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

Instant messaging and push notifications play important roles in modern digital life. To enable robust sense-making and rich context awareness in computer mediated communications, we introduce EmotionPush, a system that automatically conveys the emotion of received text with a colored push notification on mobile devices. EmotionPush is powered by state-of-the-art emotion classifiers and is deployed for Facebook Messenger clients on Android. The study showed that the system is able to help users prioritize interactions.

Visual Storytelling
Ting-Hao Kenneth Huang | Francis Ferraro | Nasrin Mostafazadeh | Ishan Misra | Aishwarya Agrawal | Jacob Devlin | Ross Girshick | Xiaodong He | Pushmeet Kohli | Dhruv Batra | C. Lawrence Zitnick | Devi Parikh | Lucy Vanderwende | Michel Galley | Margaret Mitchell
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

A Survey of Current Datasets for Vision and Language Research
Francis Ferraro | Nasrin Mostafazadeh | Ting-Hao Huang | Lucy Vanderwende | Jacob Devlin | Michel Galley | Margaret Mitchell
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer
Ting-Hao Huang | Yun-Nung Chen | Lingpeng Kong
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

2014

Social Metaphor Detection via Topical Analysis
Ting-Hao Kenneth Huang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 19, Number 2, June 2014

2013

Social Metaphor Detection via Topical Analysis
Ting-Hao Huang
Proceedings of the IJCNLP 2013 Workshop on Natural Language Processing for Social Media (SocialNLP)

2012

Modeling Pollyanna Phenomena in Chinese Sentiment Analysis
Ting-Hao Huang | Ho-Cheng Yu | Hsin-Hsi Chen
Proceedings of COLING 2012: Demonstration Papers

領域相關詞彙極性分析及文件情緒分類之研究 (Domain Dependent Word Polarity Analysis for Sentiment Classification) [In Chinese]
Ho-Cheng Yu | Ting-Hao Kenneth Huang | Hsin-Hsi Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 17, Number 4, December 2012-Special Issue on Selected Papers from ROCLING XXIV

領域相關詞彙極性分析及文件情緒分類之研究 (Domain Dependent Word Polarity Analysis for Sentiment Classification) [In Chinese]
Ho-Cheng Yu | Ting-Hao Huang | Hsin-Hsi Chen
Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012)

2011

Predicting Opinion Dependency Relations for Opinion Analysis
Lun-Wei Ku | Ting-Hao Huang | Hsin-Hsi Chen
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

Predicting Morphological Types of Chinese Bi-Character Words by Machine Learning Approaches
Ting-Hao Huang | Lun-Wei Ku | Hsin-Hsi Chen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presented an overview of Chinese bi-character words morphological types, and proposed a set of features for machine learning approaches to predict these types based on composite characters information. First, eight morphological types were defined, and 6,500 Chinese bi-character words were annotated with these types. After pre-processing, 6,178 words were selected to construct a corpus named Reduced Set. We analyzed Reduced Set and conducted the inter-annotator agreement test. The average kappa value of 0.67 indicates a substantial agreement. Second, Bi-character words morphological types are considered strongly related with the composite characters parts of speech in this paper, so we proposed a set of features which can simply be extracted from dictionaries to indicate the characters tendency of parts of speech. Finally, we used these features and adopted three machine learning algorithms, SVM, CRF, and Naïve Bayes, to predict the morphological types. On the average, the best algorithm CRF achieved 75% of the annotators performance.

Construction of a Chinese Opinion Treebank
Lun-Wei Ku | Ting-Hao Huang | Hsin-Hsi Chen
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we base on the syntactic structural Chinese Treebank corpus, construct the Chinese Opinon Treebank for the research of opinion analysis. We introduce the tagging scheme and develop a tagging tool for constructing this corpus. Annotated samples are described. Information including opinions (yes or no), their polarities (positive, neutral or negative), types (expression, status, or action), is defined and annotated. In addition, five structure trios are introduced according to the linguistic relations between two Chinese words. Four of them that are possibly related to opinions are also annotated in the constructed corpus to provide the linguistic cues. The number of opinion sentences together with the number of their polarities, opinion types, and trio types are calculated. These statistics are compared and discussed. To know the quality of the annotations in this corpus, the kappa values of the annotations are calculated. The substantial agreement between annotations ensures the applicability and reliability of the constructed corpus.

2009

Using Morphological and Syntactic Structures for Chinese Opinion Analysis
Lun-Wei Ku | Ting-Hao Huang | Hsin-Hsi Chen
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Co-authors

Francis Ferraro 4

Margaret Mitchell 4

Ho Yin Sam Ng 3

Ryan A. Rossi 3

John Joon Young Chung 2

Franck Dernoncourt 2

Michel Galley 2

Branislav Kveton 2

Nasrin Mostafazadeh 2

Rebecca J. Passonneau 2

Lucy Vanderwende 2

Min-Hsuan Yeh 2

Aishwarya Agrawal 1

Nesreen K. Ahmed 1

Aashish Anantha Ramakrishnan 1

Gromit Yeuk-Yin Chan 1

Shreya Chandrasekhar 1

Yun-Nung Chen 1

Hung-Chen Chen 1

Sheng-Yeh Chen 1

Hanieh Deilamsalehy 1

Chien-Kuang Cornelia Ding 1

Sanjana Gautam 1

Katy Ilonka Gero 1

Ross Girshick 1

Daniel Gissin 1

Chao-Chun Hsu 1

Hen-Hsen Huang 1

Yun-Hsuan Jen 1

Dongyeop Kang 1

Pushmeet Kohli 1

Lingpeng Kong 1

Chuan-Chun Kuo 1

Tristan Labetoulle 1

Dylan Langdon 1

Chun-Hui Scott Lee 1

Kuan-Chieh Lo 1

Stephanie Lukin 1

Puneet Mathur 1

Julian McAuley 1

Subhojyoti Mukherjee 1

Koyel Mukherjee 1

Pranav Narayanan Venkit 1

Thien Huu Nguyen 1

Vishakh Padmakumar 1

Soumyabrata Pal 1

Ruchi Panchanadikar 1

Bryan R. Routledge 1

Charese Smiley 1

Vallari Srivastava 1

Sarah Sterman 1

Nicholas Suwono 1

Thiemo Wambsganss 1

Shih-Ming Wang 1

William Yang Wang 1

Shomir Wilson 1

Tongshuang Wu 1

Tsai-Lun Yang 1

Seunghyun Yoon 1

C. Lawrence Zitnick 1

Venues