Yi Cao


2024

pdf bib
EpiGEN: An Efficient Multi-Api Code GENeration Framework under Enterprise Scenario
Sijie Li | Sha Li | Hao Zhang | Shuyang Li | Kai Chen | Jianyong Yuan | Yi Cao | Lvqing Yang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In recent years, Large Language Models (LLMs) have demonstrated exceptional performance in code-generation tasks. However, under enterprise scenarios where private APIs are pre-built, general LLMs often fail to meet expectations. Existing approaches are confronted with drawbacks of high resource consumption and inadequate handling of multi-API tasks. To address these challenges, we propose EpiGEN, an Efficient multi-Api code GENeration framework under enterprise scenario. It consists of three core modules: Task Decomposition Module (TDM), API Retrieval Module (ARM), and Code Generation Module (CGM), in which Langchain played an important role. Through a series of experiments, EpiGEN shows good acceptability and readability, compared to fully fine-tuned LLM with a larger number of parameters. Particularly, in medium and hard level tasks, the performance of EpiGEN on a single-GPU machine even surpasses that of a fully fine-tuned LLM that requires multi-GPU configuration. Generally, EpiGEN is model-size agnostic, facilitating a balance between the performance of code generation and computational requirements.

2023

pdf bib
PaniniQA: Enhancing Patient Education Through Interactive Question Answering
Pengshan Cai | Zonghai Yao | Fei Liu | Dakuo Wang | Meghan Reilly | Huixue Zhou | Lingxi Li | Yi Cao | Alok Kapoor | Adarsha Bajracharya | Dan Berlowitz | Hong Yu
Transactions of the Association for Computational Linguistics, Volume 11

A patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions (Zhao et al., 2017). In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients’ discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients’ misunderstandings. Our comprehensive automatic & human evaluation results demonstrate our PaniniQA is capable of improving patients’ mastery of their medical instructions through effective interactions.1

pdf bib
Neural Topic Modeling based on Cycle Adversarial Training and Contrastive Learning
Boyu Wang | Linhai Zhang | Deyu Zhou | Yi Cao | Jiandong Ding
Findings of the Association for Computational Linguistics: ACL 2023

Neural topic models have been widely used to extract common topics across documents. Recently, contrastive learning has been applied to variational autoencoder-based neural topic models, achieving promising results. However, due to the limitation of the unidirectional structure of the variational autoencoder, the encoder is enhanced with the contrastive loss instead of the decoder, leading to a gap between model training and evaluation. To address the limitation, we propose a novel neural topic modeling framework based on cycle adversarial training and contrastive learning to apply contrastive learning on the generator directly. Specifically, a self-supervised contrastive loss is proposed to make the generator capture similar topic information, which leads to better topic-word distributions. Meanwhile, a discriminative contrastive loss is proposed to cooperate with the self-supervised contrastive loss to balance the generation and discrimination. Moreover, based on the reconstruction ability of the cycle generative adversarial network, a novel data augmentation strategy is designed and applied to the topic distribution directly. Experiments have been conducted on four benchmark datasets and results show that the proposed approach outperforms competitive baselines.