Po-Yu Huang


2024

pdf bib
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Jing Yu Koh | Robert Lo | Lawrence Jang | Vikram Duvvur | Ming Lim | Po-Yu Huang | Graham Neubig | Shuyan Zhou | Russ Salakhutdinov | Daniel Fried
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on *realistic visually grounded tasks*. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web.

2017

pdf bib
NTU-1 at SemEval-2017 Task 12: Detection and classification of temporal events in clinical data with domain adaptation
Po-Yu Huang | Hen-Hsen Huang | Yu-Wun Wang | Ching Huang | Hsin-Hsi Chen
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This study proposes a system to participate in the Clinical TempEval 2017 shared task, a part of the SemEval 2017 Tasks. Domain adaptation was the main challenge this year. We took part in the supervised domain adaption where data of 591 records of colon cancer patients and 30 records of brain cancer patients from Mayo clinic were given and we are asked to analyze the records from brain cancer patients. Based on the THYME corpus released by the organizer of Clinical TempEval, we propose a framework that automatically analyzes clinical temporal events in a fine-grained level. Support vector machine (SVM) and conditional random field (CRF) were implemented in our system for different subtasks, including detecting clinical relevant events and time expression, determining their attributes, and identifying their relations with each other within the document. The results showed the capability of domain adaptation of our system.