Xiaoyan Qu


2024

pdf bib
TM-TREK at SemEval-2024 Task 8: Towards LLM-Based Automatic Boundary Detection for Human-Machine Mixed Text
Xiaoyan Qu | Xiangfeng Meng
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

With the increasing prevalence of text gener- ated by large language models (LLMs), there is a growing concern about distinguishing be- tween LLM-generated and human-written texts in order to prevent the misuse of LLMs, such as the dissemination of misleading information and academic dishonesty. Previous research has primarily focused on classifying text as ei- ther entirely human-written or LLM-generated, neglecting the detection of mixed texts that con- tain both types of content. This paper explores LLMs’ ability to identify boundaries in human- written and machine-generated mixed texts. We approach this task by transforming it into a to- ken classification problem and regard the label turning point as the boundary. Notably, our ensemble model of LLMs achieved first place in the ‘Human-Machine Mixed Text Detection’ sub-task of the SemEval’24 Competition Task 8. Additionally, we investigate factors that in- fluence the capability of LLMs in detecting boundaries within mixed texts, including the incorporation of extra layers on top of LLMs, combination of segmentation loss, and the im- pact of pretraining. Our findings aim to provide valuable insights for future research in this area.

2023

pdf bib
Samsung Research China - Beijing at SemEval-2023 Task 2: An AL-R Model for Multilingual Complex Named Entity Recognition
Haojie Zhang | Xiao Li | Renhua Gu | Xiaoyan Qu | Xiangfeng Meng | Shuo Hu | Song Liu
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes our system for SemEval-2023 Task 2 Multilingual Complex Named EntityRecognition (MultiCoNER II). Our teamSamsung Research China - Beijing proposesan AL-R (Adjustable Loss RoBERTa) model toboost the performance of recognizing short andcomplex entities with the challenges of longtaildata distribution, out of knowledge base andnoise scenarios. We first employ an adjustabledice loss optimization objective to overcomethe issue of long-tail data distribution, which isalso proved to be noise-robusted, especially incombatting the issue of fine-grained label confusing. Besides, we develop our own knowledgeenhancement tool to provide related contextsfor the short context setting and addressthe issue of out of knowledge base. Experimentshave verified the validation of our approaches.