Yi Sun

2025

OccuTriage: An AI Agent Orchestration Framework for Occupational Health Triage Prediction
Alok Kumar Sahu | Yi Sun | Eamonn Swanton | Farshid Amirabdollahian | Abi Wren
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Occupational Health (OH) triage is a systematic process for evaluating and prioritising workplace health concerns to determine appropriate care and interventions. This research addresses critical triage challenges through our novel AI agent orchestration framework, OccuTriage, developed in collaboration with Healthcare Provider. Our framework simulates healthcare professionals’ reasoning using specialized LLM agents, retrieval augmentation with domain-specific knowledge, and a bidirectional decision architecture. Experimental evaluation on 2,589 OH cases demonstrates OccuTriage outperforms single-agent approaches with a 20.16% average discordance rate compared to baseline rates of 43.05%, while matching or exceeding human expert performance (25.11%). The system excels in reducing under-triage rates, achieving 9.84% and 3.1% for appointment and assessor type decisions respectively. These results establish OccuTriage’s efficacy in performing complex OH triage while maintaining safety and optimizing resource allocation.

pdf bib abs

Recent work has demonstrated the remarkable potential of Large Language Models (LLMs) in test-time scaling. By making models think before answering, they are able to achieve much higher accuracy with extra inference computation.However, in many real-world scenarios, models are used under time constraints, where an answer should be given within a certain output length. It is unclear whether and how the reasoning ability of different LLMs remain effective under strict constraints.We take a first look at this problem by conducting an in-depth empirical study. Specifically, we test 30 LLMs on common reasoning datasets under a wide range of output length budgets, and we analyze the correlation between the inference accuracy and various properties including model type, model size, prompt style, etc. We also consider the mappings between token budgets and actual on-device latency budgets.The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning ability that differ from the unconstrained situation, e.g. the optimal choices of either model size or prompt style change under different budgets. These findings offer timely evaluation to this area and practical guidance for users to deploy LLMs under real-world latency constraints.

2022

pdf bib abs

NSP-BERT: A Prompt-based Few-Shot Learner through an Original Pre-training Task —— Next Sentence Prediction
Yi Sun | Yu Zheng | Chao Hao | Hangping Qiu
Proceedings of the 29th International Conference on Computational Linguistics

Using prompts to utilize language models to perform various downstream tasks, also known as prompt-based learning or prompt-learning, has lately gained significant success in comparison to the pre-train and fine-tune paradigm. Nonetheless, virtually most prompt-based methods are token-level such as PET based on mask language model (MLM). In this paper, we attempt to accomplish several NLP tasks in the zero-shot and few-shot scenarios using a BERT original pre-training task abandoned by RoBERTa and other models——Next Sentence Prediction (NSP). Unlike token-level techniques, our sentence-level prompt-based method NSP-BERT does not need to fix the length of the prompt or the position to be predicted, allowing it to handle tasks such as entity linking with ease. NSP-BERT can be applied to a variety of tasks based on its properties. We present an NSP-tuning approach with binary cross-entropy loss for single-sentence classification tasks that is competitive compared to PET and EFL. By continuing to train BERT on RoBERTa’s corpus, the model’s performance improved significantly, which indicates that the pre-training corpus is another important determinant of few-shot besides model size and prompt method.

pdf bib abs

Ensuring relevance quality in product search is a critical task as it impacts the customer’s ability to find intended products in the short-term as well as the general perception and trust of the e-commerce system in the long term. In this work we leverage a high-precision cross-encoder BERT model for semantic similarity between customer query and products and survey its effectiveness for three ranking applications where offline-generated scores could be used: (1) as an offline metric for estimating relevance quality impact, (2) as a re-ranking feature covering head/torso queries, and (3) as a training objective for optimization. We present results on effectiveness of this strategy for the large e-commerce setting, which has general applicability for choice of other high-precision models and tasks in ranking.

pdf bib abs

In E-commerce search, spelling correction plays an important role to find desired products for customers in processing user-typed search queries. However, resolving phonetic errors is a critical but much overlooked area. The query with phonetic spelling errors tends to appear correct based on pronunciation but is nonetheless inaccurate in spelling (e.g., “bluetooth sound system” vs. “blutut sant sistam”) with numerous noisy forms and sparse occurrences. In this work, we propose a generalized spelling correction system integrating phonetics to address phonetic errors in E-commerce search without additional latency cost. Using India (IN) E-commerce market for illustration, the experiment shows that our proposed phonetic solution significantly improves the F1 score by 9%+ and recall of phonetic errors by 8%+. This phonetic spelling correction system has been deployed to production, currently serving hundreds of millions of customers.

pdf bib abs

Using Natural Sentence Prompts for Understanding Biases in Language Models
Sarah Alnegheimish | Alicia Guo | Yi Sun
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Evaluation of biases in language models is often limited to synthetically generated datasets. This dependence traces back to the need of prompt-style dataset to trigger specific behaviors of language models. In this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in Wikipedia.We aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. We find bias evaluations are very sensitiveto the design choices of template prompts, and we propose using natural sentence prompts as a way of more systematically using real-world sentences to move away from design decisions that may bias the results.