2024
pdf
bib
abs
QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism
Bo Wang
|
Heyan Huang
|
Yixin Cao
|
Jiahao Ying
|
Wei Tang
|
Chong Feng
Findings of the Association for Computational Linguistics: EMNLP 2024
While LLMs have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanisms offer a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), which incorporates a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM’s enhanced performance compared to existing approaches.
pdf
bib
abs
Fundamental Capabilities of Large Language Models and their Applications in Domain Scenarios: A Survey
Jiawei Li
|
Yizhe Yang
|
Yu Bai
|
Xiaofeng Zhou
|
Yinghao Li
|
Huashan Sun
|
Yuhang Liu
|
Xingpeng Si
|
Yuhao Ye
|
Yixiao Wu
|
林一冠 林一冠
|
Bin Xu
|
Ren Bowen
|
Chong Feng
|
Yang Gao
|
Heyan Huang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) demonstrate significant value in domain-specific applications, benefiting from their fundamental capabilities. Nevertheless, it is still unclear which fundamental capabilities contribute to success in specific domains. Moreover, the existing benchmark-based evaluation cannot effectively reflect the performance of real-world applications. In this survey, we review recent advances of LLMs in domain applications, aiming to summarize the fundamental capabilities and their collaboration. Furthermore, we establish connections between fundamental capabilities and specific domains, evaluating the varying importance of different capabilities. Based on our findings, we propose a reliable strategy for domains to choose more robust backbone LLMs for real-world applications.
pdf
bib
abs
RAAMove: A Corpus for Analyzing Moves in Research Article Abstracts
Hongzheng Li
|
Ruojin Wang
|
Ge Shi
|
Xing Lv
|
Lei Lei
|
Chong Feng
|
Fang Liu
|
Jinkun Lin
|
Yangguang Mei
|
Linnan Xu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Move structures have been studied in English for Specific Purposes (ESP) and English for Academic Purposes (EAP) for decades. However, there are few move annotation corpora for Research Article (RA) abstracts. In this paper, we introduce RAAMove, a comprehensive multi-domain corpus dedicated to the annotation of move structures in RA abstracts. The primary objective of RAAMove is to facilitate move analysis and automatic move identification. This paper provides a thorough discussion of the corpus construction process, including the scheme, data collection, annotation guidelines, and annotation procedures. The corpus is constructed through two stages: initially, expert annotators manually annotate high-quality data; subsequently, based on the human-annotated data, a BERT-based model is employed for automatic annotation with the help of experts’ modification. The result is a large-scale and high-quality corpus comprising 33,988 annotated instances. We also conduct preliminary move identification experiments using the BERT-based model to verify the effectiveness of the proposed corpus and model. The annotated corpus is available for academic research purposes and can serve as essential resources for move analysis, English language teaching and writing, as well as move/discourse-related tasks in Natural Language Processing (NLP).
2023
pdf
bib
abs
Boosting Event Extraction with Denoised Structure-to-Text Augmentation
Bo Wang
|
Heyan Huang
|
Xiaochi Wei
|
Ge Shi
|
Xiao Liu
|
Chong Feng
|
Tong Zhou
|
Shuaiqiang Wang
|
Dawei Yin
Findings of the Association for Computational Linguistics: ACL 2023
Event extraction aims to recognize pre-defined event triggers and arguments from texts, which suffer from the lack of high-quality annotations. In most NLP applications, involving a large scale of synthetic training data is a practical and effective approach to alleviate the problem of data scarcity. However, when applying to the task of event extraction, recent data augmentation methods often neglect the problem of grammatical incorrectness, structure misalignment, and semantic drifting, leading to unsatisfactory performances. In order to solve these problems, we propose a denoised structure-to-text augmentation framework for event extraction (DAEE), which generates additional training data through the knowledge-based structure-to-text generation model and selects the effective subset from the generated data iteratively with a deep reinforcement learning agent. Experimental results on several datasets demonstrate that the proposed method generates more diverse text representations for event extraction and achieves comparable results with the state-of-the-art.
2021
pdf
bib
abs
Enlivening Redundant Heads in Multi-head Self-attention for Machine Translation
Tianfu Zhang
|
Heyan Huang
|
Chong Feng
|
Longbing Cao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Multi-head self-attention recently attracts enormous interest owing to its specialized functions, significant parallelizable computation, and flexible extensibility. However, very recent empirical studies show that some self-attention heads make little contribution and can be pruned as redundant heads. This work takes a novel perspective of identifying and then vitalizing redundant heads. We propose a redundant head enlivening (RHE) method to precisely identify redundant heads, and then vitalize their potential by learning syntactic relations and prior knowledge in the text without sacrificing the roles of important heads. Two novel syntax-enhanced attention (SEA) mechanisms: a dependency mask bias and a relative local-phrasal position bias, are introduced to revise self-attention distributions for syntactic enhancement in machine translation. The importance of individual heads is dynamically evaluated during the redundant heads identification, on which we apply SEA to vitalize redundant heads while maintaining the strength of important heads. Experimental results on widely adopted WMT14 and WMT16 English to German and English to Czech language machine translation validate the RHE effectiveness.
2020
pdf
bib
abs
面向司法领域的高质量开源藏汉平行语料库构建(A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain)
Jiu Sha (沙九)
|
Luqin Zhou (周鹭琴)
|
Chong Feng (冯冲)
|
Hongzheng Li (李洪政)
|
Tianfu Zhang (张天夫)
|
Hui Hui (慧慧)
Proceedings of the 19th Chinese National Conference on Computational Linguistics
面向司法领域的藏汉机器翻译面临严重的数据稀疏问题。本文将从两个方面展录研究:第一,相比于通用领域,司法领域的藏语要有更严谨的逻辑表达和更多的专业术语。然而,目前藏语资源在司法领域内缺乏对应的语料,稀缺专业术语词以及句法结构。第二,藏语的特殊词汇表达方式和特定句法结构使得通用语料构建方法难以构建藏汉平行语料库。为此,本文提出仺种针对司法领域藏汉平行语料的轻量级构建方法。首先,我们采取人工标注获取一个中等规模的司法领域藏汉专业术语表作为先验知识库,以避免领域越界而产生的语料逻辑表达问题和领域术语缺失问题;其次,我们从全国的地方法庭官网采集实例语料数据,例如裁判文书。我们优先寻找藏文实例数据,其次是汉语,以避免后续构造藏语句子而丢失特殊的词汇表达和句式结构。我们基于以上原则采集藏汉语料构建高质量的藏汉平行语料库,具体方法包括:爬虫获取语料,规则断章对齐检测,语句边界识别,语料库自动清洗。朂终,我们构建了16万级规模的藏汉司法领域语料库,并通过多种翻译模型和交叉实验验证了构建的语料库的高质量特点和鲁棒性。另外,此语料库会弚源以便于相关研究人员用于科研工作。
2019
pdf
bib
abs
A Context-based Framework for Modeling the Role and Function of On-line Resource Citations in Scientific Literature
He Zhao
|
Zhunchen Luo
|
Chong Feng
|
Anqing Zheng
|
Xiaopeng Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
We introduce a new task of modeling the role and function for on-line resource citations in scientific literature. By categorizing the on-line resources and analyzing the purpose of resource citations in scientific texts, it can greatly help resource search and recommendation systems to better understand and manage the scientific resources. For this novel task, we are the first to create an annotation scheme, which models the different granularity of information from a hierarchical perspective. And we construct a dataset SciRes, which includes 3,088 manually annotated resource contexts. In this paper, we propose a possible solution by using a multi-task framework to build the scientific resource classifier (SciResCLF) for jointly recognizing the role and function types. Then we use the classification results to help a scientific resource recommendation (SciResREC) task. Experiments show that our model achieves the best results on both the classification task and the recommendation task. The SciRes dataset is released for future research.
2018
pdf
bib
abs
Genre Separation Network with Adversarial Training for Cross-genre Relation Extraction
Ge Shi
|
Chong Feng
|
Lifu Huang
|
Boliang Zhang
|
Heng Ji
|
Lejian Liao
|
Heyan Huang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Relation Extraction suffers from dramatical performance decrease when training a model on one genre and directly applying it to a new genre, due to the distinct feature distributions. Previous studies address this problem by discovering a shared space across genres using manually crafted features, which requires great human effort. To effectively automate this process, we design a genre-separation network, which applies two encoders, one genre-independent and one genre-shared, to explicitly extract genre-specific and genre-agnostic features. Then we train a relation classifier using the genre-agnostic features on the source genre and directly apply to the target genre. Experiment results on three distinct genres of the ACE dataset show that our approach achieves up to 6.1% absolute F1-score gain compared to previous methods. By incorporating a set of external linguistic features, our approach outperforms the state-of-the-art by 1.7% absolute F1 gain. We make all programs of our model publicly available for research purpose
2016
pdf
bib
CSE: Conceptual Sentence Embeddings based on Attention Model
Yashen Wang
|
Heyan Huang
|
Chong Feng
|
Qiang Zhou
|
Jiahui Gu
|
Xiong Gao
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2015
pdf
bib
Topic-Based Chinese Message Polarity Classification System at SIGHAN8-Task2
Chun Liao
|
Chong Feng
|
Sen Yang
|
Heyan Huang
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing
2012
pdf
bib
Emotional Tendency Identification for Micro-blog Topics Based on Multiple Characteristics
Quanchao Liu
|
Chong Feng
|
Heyan Huang
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation