Xiaocheng Feng - ACL Anthology

Xiaocheng Feng

Also published as: 骁骋冯

2025

Large language models (LLMs) are known to suffer from severe hallucination issues. One of the main causes lies in the knowledge misalignment between the pre-training stage and the supervised fine-tuning stage. The unfamiliar knowledge encountered during fine-tuning may encourage LLMs to generate facts that are not grounded in parametric knowledge. To address this, we propose Seal, a novel training objective with an abstention mechanism, in which the model learns to selectively reject tokens that misalign with the desired knowledge distribution via a special [REJ] token. This allows the model the option of acknowledging the insufficiency of knowledge rather than blindly assigning high probability to all ground-truth answers. We further propose a regularized decoding objective that penalizes uncertain predictions during inference by using the [REJ] probability learned during training. Extensive experiments on six short-form and long-form QA datasets with three LLMs of different sizes demonstrate that our method effectively alleviates hallucinations caused by knowledge misalignment. Further analysis highlights the adaptations of our method in answer refusal scenarios and its ability to effectively maintain the model’s instruction-following capabilities.

CC-Tuning: A Cross-Lingual Connection Mechanism for Improving Joint Multilingual Supervised Fine-Tuning
Yangfan Ye | Xiaocheng Feng | Zekun Yuan | Xiachong Feng | Libo Qin | Lei Huang | Weitao Ma | Yichong Huang | Zhirui Zhang | Yunfei Lu | Xiaohui Yan | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Current large language models (LLMs) often exhibit imbalanced multilingual capabilities due to their English-centric training corpora. To address this, existing fine-tuning approaches operating at the data-level (e.g., through data augmentation or distillation) typically introduce implicit cross-lingual alignment, overlooking the potential for more profound, latent-level cross-lingual interactions. In this work, we propose CC-Tuning, a novel multilingual fine-tuning paradigm that explicitly establishes a cross-lingual connection mechanism at the latent level. During training, CC-Tuning fuses the feed forward activations from both English and non-English inputs, enabling the model to benefit from both linguistic resources. This process is facilitated with a trainable Decision Maker that identifies beneficial activations. Furthermore, during inference, a Transform Matrix is utilized to simulate the cross-lingual connection under monolingual setting through representation transformation. Our experiments on six benchmarks covering 22 languages show that CC-Tuning outperforms vanilla SFT and offers a strong latent-level alternative to data-level augmentation methods. Further analysis also highlights the practicality of CC-Tuning and the potential of latent-level cross-lingual interactions in advancing the multilingual performance of LLMs.

SLAM: Towards Efficient Multilingual Reasoning via Selective Language Alignment
Yuchun Fan | Yongyu Mu | YiLin Wang | Lei Huang | Junhao Ruan | Bei Li | Tong Xiao | Shujian Huang | Xiaocheng Feng | Jingbo Zhu
Proceedings of the 31st International Conference on Computational Linguistics

Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers’ feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9× compared to the two-stage method.

Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM-Guided Multi-Aspect Clustering
Kun Zhu | Lizi Liao | Yuxuan Gu | Lei Huang | Xiaocheng Feng | Bing Qin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.

Unveiling Entity-Level Unlearning for Large Language Models: A Comprehensive Analysis
Weitao Ma | Xiaocheng Feng | Weihong Zhong | Lei Huang | Yangfan Ye | Xiachong Feng | Bing Qin
Proceedings of the 31st International Conference on Computational Linguistics

Large language model unlearning has garnered increasing attention due to its potential to address security and privacy concerns, leading to extensive research in the field. However, existing studies have predominantly focused on instance-level unlearning, specifically targeting the removal of predefined instances containing sensitive content. This focus has left a gap in the exploration of removing an entire entity, which is critical in real-world scenarios such as copyright protection. To close this gap, we propose a novel task named Entity-level unlearning, which aims to erase entity-related knowledge from the target model completely. To investigate this task, we systematically evaluate popular unlearning algorithms, revealing that current methods struggle to achieve effective entity-level unlearning. Then, we further explore the factors that influence the performance of unlearning algorithms, identifying that the knowledge coverage of the forget set and its size play pivotal roles. Notably, our analysis also uncovers that entities introduced through fine-tuning are more vulnerable than pre-trained entities during unlearning. We hope these findings can inspire future improvements in entity-level unlearning for LLMs.

One for All: Update Parameterized Knowledge Across Multiple Models with Once Edit
Weitao Ma | Xiyuan Du | Xiaocheng Feng | Lei Huang | Yichong Huang | Huiyi Zhang | Xiaoliang Yang | Baohang Li | Xiachong Feng | Ting Liu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) encode vast world knowledge but struggle to stay up-to-date, often leading to errors and hallucinations. Knowledge editing offers an efficient alternative to retraining, enabling targeted modifications by updating specific model parameters. However, existing methods primarily focus on individual models, posing challenges in efficiently updating multiple models and adapting to new models. To address this, we propose OnceEdit, a novel ensemble-based approach that employs a plug-in model as the editing module, enabling stable knowledge updates across multiple models. Building on the model ensemble, OnceEdit introduces two key mechanisms to enhance its effectiveness. First, we introduce a dynamic weight mechanism through a weight token for distinguishing between edit-related and non-edit-related instances, ensuring the appropriate utilization of knowledge from integrated models. Second, we incorporate an ensemble enhancement mechanism to mitigate the excessive reliance on the central model inherent in the model ensemble technique, making it more suitable for knowledge editing. Extensive experiments on diverse LLMs demonstrate that OnceEdit consistently outperforms existing methods while achieving superior editing efficiency. Further analysis confirms its adaptability and stability in multi-model editing scenarios.

Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research.

FroM: Frobenius Norm-Based Data-Free Adaptive Model Merging
Zijian Li | Xiaocheng Feng | Huixin Liu | Yichong Huang | Ting Liu | Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2025

With the development of large language models, fine-tuning has emerged as an effective method to enhance performance in specific scenarios by injecting domain-specific knowledge. In this context, model merging techniques provide a solution for fusing knowledge from multiple fine-tuning models by combining their parameters. However, traditional methods often encounter task interference when merging full fine-tuning models, and this problem becomes even more evident in parameter-efficient fine-tuning scenarios. In this paper, we introduce an improvement to the RegMean method, which indirectly leverages the training data to approximate the outputs of the linear layers before and after merging. We propose an adaptive merging method called FroM, which directly measures the model parameters using the Frobenius norm, without any training data. By introducing an additional hyperparameter for control, FroM outperforms baseline methods across various fine-tuning scenarios, alleviating the task interference problem.

Length Controlled Generation for Black-box LLMs
Yuxuan Gu | Wenjie Wang | Xiaocheng Feng | Weihong Zhong | Kun Zhu | Lei Huang | Ting Liu | Bing Qin | Tat-Seng Chua
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large language models (LLMs) have demonstrated impressive instruction following capabilities, while still struggling to accurately manage the length of the generated text, which is a fundamental requirement in many real-world applications. Existing length control methods involve fine-tuning the parameters of LLMs, which is inefficient and suboptimal for practical use. In this paper, we propose a novel iterative sampling framework for text length control, integrating the Metropolis-Hastings algorithm with an importance sampling acceleration strategy. This framework efficiently and reliably regulates LLMs to generate length-constrained text without modifying the underlying parameters, thereby preserving the original capabilities of LLMs. Experimental results demonstrate that our framework achieves almost 100% success rates of length control on Llama3.1 for tasks such as length-controlled abstractive summarization and length-constrained instruction following, with minimal additional computational overhead. This also highlights the significant potential of our method for precise length control across a broader range of applications, without compromising the versatility of LLMs.

Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models
Yuchun Fan | Yilin Wang | Yongyu Mu | Lei Huang | Bei Li | Xiaocheng Feng | Tong Xiao | JingBo Zhu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MMBench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST facilitates the language-specific visual information engagement in shallow layers.

Improving Contextual Faithfulness of Large Language Models via Retrieval Heads-Induced Optimization
Lei Huang | Xiaocheng Feng | Weitao Ma | Yuchun Fan | Xiachong Feng | Yangfan Ye | Weihong Zhong | Yuxuan Gu | Baoxin Wang | Dayong Wu | Guoping Hu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Ensuring contextual faithfulness in retrieval-augmented large language models (LLMs) is crucial for building trustworthy information-seeking systems, particularly in long-form question-answering (LFQA) scenarios. In this work, we identify a salient correlation between LFQA faithfulness and retrieval heads, a set of attention heads responsible for retrieving contextual information. Leveraging this insight, we propose RHIO, a framework designed to teach LLMs to explicitly discriminate between faithful and unfaithful generations. RHIO first augments unfaithful samples that simulate realistic model-intrinsic errors by selectively masking retrieval heads. Then, these samples are incorporated into joint training, enabling the model to distinguish unfaithful outputs from faithful ones conditioned on control tokens. Furthermore, these control tokens are leveraged to self-induce contrastive outputs, amplifying their difference through contrastive decoding. Additionally, to facilitate the evaluation of contextual faithfulness, we also introduce GroundBench, a comprehensive benchmark compiled from five existing LFQA datasets. Extensive experimental results on GroundBench demonstrate that RHIO significantly improves faithfulness, even outperforming GPT-4o.

CLAIM: Mitigating Multilingual Object Hallucination in Large Vision-Language Models with Cross-Lingual Attention Intervention
Zekai Ye | Qiming Li | Xiaocheng Feng | Libo Qin | Yichong Huang | Baohang Li | Kui Jiang | Yang Xiang | Zhirui Zhang | Yunfei Lu | Duyu Tang | Dandan Tu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Vision-Language Models (LVLMs) have demonstrated impressive multimodal abilities but remain prone to multilingual object hallucination, with a higher likelihood of generating responses inconsistent with the visual input when utilizing queries in non-English languages compared to English. Most existing approaches to address these rely on pretraining or fine-tuning, which are resource-intensive. In this paper, inspired by observing the disparities in cross-modal attention patterns across languages, we propose Cross-Lingual Attention Intervention for Mitigating multilingual object hallucination (CLAIM) in LVLMs, a novel near training-free method by aligning attention patterns. CLAIM first identifies language-specific cross-modal attention heads, then estimates language shift vectors from English to the target language, and finally intervenes in the attention outputs during inference to facilitate cross-lingual visual perception capability alignment. Extensive experiments demonstrate that CLAIM achieves an average improvement of 13.56% (up to 30% in Spanish) on the POPE and 21.75% on the hallucination subsets of the MME benchmark across various languages. Further analysis reveals that multilingual attention divergence is most prominent in intermediate layers, highlighting their critical role in multilingual scenarios.

Probing and Boosting Large Language Models Capabilities via Attention Heads
Dezhi Zhao | Xin Liu | Xiaocheng Feng | Hui Wang | Bing Qin
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Understanding the internal origins of capabilities in large language models (LLMs) is crucial for interpretability and efficient adaptation. However, the emergence of specific capabilities remains poorly understood, as most existing approaches rely on external signals (e.g., performance shifts or gradient similarities) with limited structural grounding. To address these issues, this paper proposes a lightweight and highly interpretable approach that links LLM capabilities to internal components by identifying correspondences at the level of attention heads. Specifically, we first define five fundamental capabilities, namely Mathematical Reasoning, Reading Comprehension, Commonsense Reasoning, Scientific Reasoning, and Professional Expertise, and employ probing techniques to detect the attention heads most predictive of each, thereby establishing capability–head mappings. For targeted instruction tuning, complex tasks are decomposed into these fundamental capabilities, and training data are selected accordingly. Experiments on LLaMA3.1-8B and Qwen2.5-7B show over 70% discrimination accuracy in identifying capabilities. On MMLU and BBH, our method improves accuracy by 1 to 1.5 points over the gradient-based method LESS and by 5 to 6 points over other intermediate-state baselines.

2024

GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual, Cross-lingual and Multi-document News Summarization
Yangfan Ye | Xiachong Feng | Xiaocheng Feng | Weitao Ma | Libo Qin | Dongliang Xu | Qing Yang | Hongtao Liu | Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

News summarization in today’s global scene can be daunting with its flood of multilingual content and varied viewpoints from different sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task, i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through extensive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.

Advancing Large Language Model Attribution through Self-Improving
Lei Huang | Xiaocheng Feng | Weitao Ma | Liang Zhao | Yuchun Fan | Weihong Zhong | Dongliang Xu | Qing Yang | Hongtao Liu | Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Teaching large language models (LLMs) to generate text with citations to evidence sources can mitigate hallucinations and enhance verifiability in information-seeking systems. However, improving this capability requires high-quality attribution data, which is costly and labor-intensive. Inspired by recent advances in self-improvement that enhance LLMs without manual annotation, we present START, a Self-Taught AttRibuTion framework for iteratively improving the attribution capability of LLMs. First, to prevent models from stagnating due to initially insufficient supervision signals, START leverages the model to self-construct synthetic training data for warming up. To further self-improve the model’s attribution ability, START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average without relying on human annotations and more advanced models. Further analysis reveals that START excels in aggregating information across multiple sources.

Learning Fine-Grained Grounded Citations for Attributed Large Language Models
Lei Huang | Xiaocheng Feng | Weitao Ma | Yuxuan Gu | Weihong Zhong | Xiachong Feng | Weijiang Yu | Weihua Peng | Duyu Tang | Dandan Tu | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2024

Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, demonstrate potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of merely citing document identifiers complicates the process for users to pinpoint specific supporting evidence. In this work, we introduce FRONT, a training framework that teaches LLMs to generate Fine-grained grounded citations. By initially grounding fine-grained supporting quotes, which then guide the generation process, these quotes not only provide supervision signals to improve citation quality but also serve as fine-grained attributions. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT.

Aligning Translation-Specific Understanding to General Understanding in Large Language Models
Yichong Huang | Baohang Li | Xiaocheng Feng | Wenshuai Huo | Chengpeng Fu | Ting Liu | Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Large Language models (LLMs) have exhibited remarkable abilities in understanding complex texts, offering a promising path towards human-like translation performance. However, this study reveals the misalignment between the translation-specific understanding and the general understanding inside LLMs. This understanding misalignment leads to LLMs mistakenly or literally translating some complicated concepts that they accurately comprehend in the general scenarios (e.g., QA). To align the translation-specific understanding to the general one, we propose a novel translation process, DUAT (Difficult words Understanding Aligned Translation), explicitly incorporating the general understanding on the complicated content incurring inconsistent understandings to guide the translation. Specifically, DUAT performs cross-lingual interpretation for the difficult-to-translate words and enhances the translation with the generated interpretations. Furthermore, we reframe the external tools to improve DUAT in detecting difficult words and generating helpful interpretations. We conduct experiments on the self-constructed benchmark Challenge-WMT, consisting of samples that are prone to mistranslation. Human evaluation results on high-resource and low-resource language pairs indicate that DUAT significantly facilitates the understanding alignment, which improves the translation quality (up to +3.85 COMET) and reduces translation literalness by -25% ∼ -51%.

Gradient Consistency-based Parameter Allocation for Multilingual Neural Machine Translation
Wenshuai Huo | Xiaocheng Feng | Yichong Huang | Chengpeng Fu | Hui Wang | Bing Qin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Multilingual neural machine translation handles the translation of multiple languages with one unified model. However, this joint-training paradigm incurs the notorious issue of parameter interference, where the model compromises with the language diversity to find a common solution. Recent research has explored avoiding this problem by selecting certain parameters for each language direction from the original model to form language-specific sub-networks. However, determining how many parameters to choose and which parameters to select is still a serious challenge. In this work, we propose an approach called CaPA (Consistency-based Parameter Allocation), which dynamically allocates parameters of appropriate scale to each language direction based on the consistency between the gradient of the individual language and the average gradient. Specifically, CaPA allocates more parameters to languages with higher gradient consistency as these languages tend to have a more positive impact on other languages. Furthermore, considering the varying levels of interference across different parts of the model, we propose an adaptive parameter allocation based on module-level gradient consistency. Experimental results show the correlation between gradient consistency and parameter interference, as well as the effectiveness of our proposed method.

SCIR-MT’s Submission for WMT24 General Machine Translation Task
Baohang Li | Zekai Ye | Yichong Huang | Xiaocheng Feng | Bing Qin
Proceedings of the Ninth Conference on Machine Translation

This paper introduces the submission of SCIR research center of Harbin Institute of Technology participating in the WMT24 machine translation evaluation task of constrained track for English to Czech. Our approach involved a rigorous process of cleaning and deduplicating both monolingual and bilingual data, followed by a three-stage model training recipe. During the testing phase, we used the beam serach decoding method to generate a large number of candidate translations. Furthermore, we employed COMET-MBR decoding to identify optimal translations.

浅谈大模型时代下的检索增强:发展趋势、挑战与展望(Enhancing Large Language Models with Retrieval-Augmented Techniques: Trends, Challenges, and Prospects)
Zhangyin Feng (冯掌印) | Kun Zhu (朱坤) | Weitao Ma (马伟涛) | Lei Huang (黄磊) | Bing Qin (秦兵) | Ting Liu (刘挺) | Xiaocheng Feng (冯骁骋)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)

“大型语言模型(LLM) 在各种自然语言任务上表现出了卓越的性能,但它们很容易受到过时数据和特定领域限制的影响。为了应对这些挑战,研究人员整合不同来源的外部信息来增强大语言模型,具体方法如检索增强等。在本文中,我们综合讨论了检索增强技术的发展趋势,包括检索时机规划、检索技术、以及检索结果的利用。此外,我们介绍了当前可用于检索增强任务的数据集和评价方法,并指出了应用和潜在研究方向。我们希望这项综述能够为社区提供对该研究领域的快速了解和全面概述,以启发未来的研究工作。”

Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
Weihong Zhong | Xiaocheng Feng | Liang Zhao | Qiming Li | Lei Huang | Yuxuan Gu | Weitao Ma | Yuan Xu | Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Though advanced in understanding visual information with human languages, Large Vision-Language Models (LVLMs) still suffer from multimodal hallucinations. A natural concern is that during multimodal interaction, the generated hallucinations could influence the LVLMs’ subsequent generation. Thus, we raise a question: When presented with a query relevant to the previously generated hallucination, will LVLMs be misled and respond incorrectly, even though the ground visual information exists? To answer this, we propose a framework called \\textitMMHalSnowball to evaluate LVLMs’ behaviors when encountering generated hallucinations, where LVLMs are required to answer specific visual questions within a curated hallucinatory conversation. Crucially, our experiment shows that the performance of open-source LVLMs drops by at least 31\\%, indicating that LVLMs are prone to accept the generated hallucinations and make false claims that they would not have supported without distractions. We term this Multimodal Hallucination Snowballing. To mitigate this issue, we further propose a training-free method called Residual Visual Decoding, where we revise the output distribution of LVLMs with the one derived from the residual visual input, providing models with direct access to the visual information. Experiments show that our method can mitigate more than 24\\% of the snowballed multimodal hallucination while maintaining capabilities.

Extending Context Window of Large Language Models from a Distributional Perspective
Yingsheng Wu | Yuxuan Gu | Xiaocheng Feng | Weihong Zhong | Dongliang Xu | Qing Yang | Hongtao Liu | Bing Qin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model’s capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2’s context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model’s performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.

Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding
Liang Zhao | Xiachong Feng | Xiaocheng Feng | Weihong Zhong | Dongliang Xu | Qing Yang | Hongtao Liu | Bing Qin | Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2024

Built upon the Transformer, large language models (LLMs) have captured worldwide attention due to their remarkable abilities. Nevertheless, all Transformer-based models including LLMs suffer from a preset length limit and can hardly generalize from short training sequences to longer inference ones, namely, they can not perform **length extrapolation** to handle long sequences. Thus, numerous methods have emerged to enhance the length extrapolation of Transformers. Despite the great research efforts, a systematic survey is still lacking. To fill this gap, we delve into these advances in a unified notation from the perspective of positional encoding (PE), as it has been considered the primary factor on length extrapolation. Specifically, we begin with extrapolatable PEs that have dominated this research field. Then, we dive into extrapolation methods based on them, covering position interpolation and randomized position methods. Finally, several challenges and future directions in this area are highlighted. Through this survey, We aim to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.

An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation
Kun Zhu | Xiaocheng Feng | Xiyuan Du | Yuxuan Gu | Weijiang Yu | Haotian Wang | Qianglong Chen | Zheng Chu | Jingchang Chen | Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Retrieval-augmented generation integrates the capabilities of large language models with relevant information retrieved from an extensive corpus, yet encounters challenges when confronted with real-world noisy data. One recent solution is to train a filter module to find relevant content but only achieve suboptimal noise compression. In this paper, we propose to introduce the information bottleneck theory into retrieval-augmented generation. Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output, while minimizing the mutual information between compression and retrieved passage. In addition, we derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations, the selection of supervised fine-tuning data, and the construction of reinforcement learning rewards. Experimental results demonstrate that our approach achieves significant improvements across various question answering datasets, not only in terms of the correctness of answer generation but also in the conciseness with 2.5% compression rate.

2023

Towards Higher Pareto Frontier in Multilingual Machine Translation
Yichong Huang | Xiaocheng Feng | Xinwei Geng | Baohang Li | Bing Qin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multilingual neural machine translation has witnessed remarkable progress in recent years. However, the long-tailed distribution of multilingual corpora poses a challenge of Pareto optimization, i.e., optimizing for some languages may come at the cost of degrading the performance of others. Existing balancing training strategies are equivalent to a series of Pareto optimal solutions, which trade off on a Pareto frontierIn Pareto optimization, Pareto optimal solutions refer to solutions in which none of the objectives can be improved without sacrificing at least one of the other objectives. The set of all Pareto optimal solutions forms a Pareto frontier..In this work, we propose a new training framework, Pareto Mutual Distillation (Pareto-MD), towards pushing the Pareto frontier outwards rather than making trade-offs. Specifically, Pareto-MD collaboratively trains two Pareto optimal solutions that favor different languages and allows them to learn from the strengths of each other via knowledge distillation. Furthermore, we introduce a novel strategy to enable stronger communication between Pareto optimal solutions and broaden the applicability of our approach. Experimental results on the widely-used WMT and TED datasets show that our method significantly pushes the Pareto frontier and outperforms baselines by up to +2.46 BLEUOur code will be released upon acceptance..

Enabling Unsupervised Neural Machine Translation with Word-level Visual Representations
Chengpeng Fu | Xiaocheng Feng | Yichong Huang | Wenshuai Huo | Hui Wang | Bing Qin | Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2023

Unsupervised neural machine translation has recently made remarkable strides, achieving impressive results with the exclusive use of monolingual corpora. Nonetheless, these methods still exhibit fundamental flaws, such as confusing similar words. A straightforward remedy to rectify this drawback is to employ bilingual dictionaries, however, high-quality bilingual dictionaries can be costly to obtain. To overcome this limitation, we propose a method that incorporates images at the word level to augment the lexical mappings. Specifically, our method inserts visual representations into the model, modifying the corresponding embedding layer information. Besides, a visible matrix is adopted to isolate the impact of images on other unrelated words. Experiments on the Multi30k dataset with over 300,000 self-collected images validate the effectiveness in generating more accurate word translation, achieving an improvement of up to +2.81 BLEU score, which is comparable or even superior to using bilingual dictionaries.

Controllable Text Generation via Probability Density Estimation in the Latent Space
Yuxuan Gu | Xiaocheng Feng | Sicheng Ma | Lingyuan Zhang | Heng Gong | Weihong Zhong | Bing Qin
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Previous work on controllable text generation has explored the idea of control from the latent space, such as optimizing a representation with attribute-specific classifiers or sampling one from relevant discrete samples. However, they cannot effectively model a complex space with diverse attributes, high dimensionality, and asymmetric structure, leaving subsequent controls unsatisfying. In this work, we propose a novel control framework using probability density estimation in the latent space. Our method utilizes an invertible transformation function, the Normalizing Flow, that maps the complex distributions in the latent space to simple Gaussian distributions in the prior space. Thus, we can perform sophisticated and flexible controls in the prior space and feed the control effects back into the latent space owing to the bijection property of invertible transformations. Experiments on single-attribute and multi-attribute control reveal that our method outperforms several strong baselines on attribute relevance and text quality, achieving a new SOTA. Further analysis of control strength adjustment demonstrates the flexibility of our control strategy.

Hierarchical Catalogue Generation for Literature Review: A Benchmark
Kun Zhu | Xiaocheng Feng | Xiachong Feng | Yingsheng Wu | Bing Qin
Findings of the Association for Computational Linguistics: EMNLP 2023

Scientific literature review generation aims to extract and organize important information from an abundant collection of reference papers and produces corresponding reviews while lacking a clear and logical hierarchy. We observe that a high-quality catalogue-guided generation process can effectively alleviate this problem. Therefore, we present an atomic and challenging task named Hierarchical Catalogue Generation for Literature Review as the first step for review generation, which aims to produce a hierarchical catalogue of a review paper given various references. We construct a novel English Hierarchical Catalogues of Literature Reviews Dataset with 7.6k literature review catalogues and 389k reference papers. To accurately assess the model performance, we design two evaluation metrics for informativeness and similarity to ground truth from semantics and structure. Our extensive analyses verify the high quality of our dataset and the effectiveness of our evaluation metrics. We further benchmark diverse experiments on state-of-the-art summarization models like BART and large language models like ChatGPT to evaluate their capabilities. We further discuss potential directions for this task to motivate future research.

Improved Visual Story Generation with Adaptive Context Modeling
Zhangyin Feng | Yuchen Ren | Xinmiao Yu | Xiaocheng Feng | Duyu Tang | Shuming Shi | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2023

Diffusion models developed on top of powerful text-to-image generation models like Stable Diffusion achieve remarkable success in visual story generation. However, the best-performing approach considers historically generated results as flattened memory cells, ignoring the fact that not all preceding images contribute equally to the generation of the characters and scenes at the current stage. To address this, we present a simple method that improves the leading system with adaptive context modeling, which is not only incorporated in the encoder but also adopted as additional guidance in the sampling stage to boost the global consistency of the generated story. We evaluate our model on PororoSV and FlintstonesSV datasets and show that our approach achieves state-of-the-art FID scores on both story visualization and continuation scenarios. We conduct detailed model analysis and show that our model excels at generating semantically consistent images for stories.

2022

MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization
Xiachong Feng | Xiaocheng Feng | Bing Qin
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Dialogue summarization helps users capture salient information from various types of dialogues has received much attention recently. However, current works mainly focus on English dialogue summarization, leaving other languages less well explored. Therefore, we present a multi-lingual dialogue summarization dataset, namely MSAMSum, which covers dialogue-summary pairs in six languages. Specifically, we derive MSAMSum from the standard SAMSum using sophisticated translation techniques and further employ two methods to ensure the integral translation quality and summary factual consistency. Given the proposed MSAMum, we systematically set up five multi-lingual settings for this task, including a novel mix-lingual dialogue summarization setting. To illustrate the utility of our dataset, we benchmark various experiments with pre-trained models under different settings and report results in both supervised and zero-shot manners. We also discuss some future works towards this task to motivate future researches.

Improving Controllable Text Generation with Position-Aware Weighted Decoding
Yuxuan Gu | Xiaocheng Feng | Sicheng Ma | Jiaming Wu | Heng Gong | Bing Qin
Findings of the Association for Computational Linguistics: ACL 2022

Weighted decoding methods composed of the pretrained language model (LM) and the controller have achieved promising results for controllable text generation. However, these models often suffer from a control strength/fluency trade-off problem as higher control strength is more likely to generate incoherent and repetitive text. In this paper, we illustrate this trade-off is arisen by the controller imposing the target attribute on the LM at improper positions. And we propose a novel framework based on existing weighted decoding methods called CAT-PAW, which introduces a lightweight regulator to adjust bias signals from the controller at different decoding positions. Experiments on positive sentiment control, topic control, and language detoxification show the effectiveness of our CAT-PAW upon 4 SOTA models.

Unifying the Convergences in Multilingual Neural Machine Translation
Yichong Huang | Xiaocheng Feng | Xinwei Geng | Bing Qin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Although all-in-one-model multilingual neural machine translation (MNMT) has achieved remarkable progress, the convergence inconsistency in the joint training is ignored, i.e., different language pairs reaching convergence in different epochs. This leads to the trained MNMT model over-fitting low-resource language translations while under-fitting high-resource ones. In this paper, we propose a novel training strategy named LSSD (LanguageSpecific Self-Distillation), which can alleviate the convergence inconsistency and help MNMT models achieve the best performance on each language pair simultaneously. Specifically, LSSD picks up language-specific best checkpoints for each language pair to teach the current model on the fly. Furthermore, we systematically explore three sample-level manipulations of knowledge transferring. Experimental results on three datasets show that LSSD obtains consistent improvements towards all language pairs and achieves the state-of-the-art.

A Distributional Lens for Multi-Aspect Controllable Text Generation
Yuxuan Gu | Xiaocheng Feng | Sicheng Ma | Lingyuan Zhang | Heng Gong | Bing Qin
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Multi-aspect controllable text generation is a more challenging and practical task than single-aspect control. Existing methods achieve complex multi-aspect control by fusing multiple controllers learned from single-aspect, but suffer from attribute degeneration caused by the mutual interference of these controllers. To address this, we provide observations on attribute fusion from a distributional perspective and propose to directly search for the intersection areas of multiple attribute distributions as their combination for generation. Our method first estimates the attribute space with an autoencoder structure. Afterward, we iteratively approach the intersections by jointly minimizing distances to points representing different attributes. Finally, we map them to attribute-relevant sentences with a prefix-tuning-based decoder. Experiments on the three-aspect control task, including sentiment, topic, and detoxification aspects, reveal that our method outperforms several strong baselines on attribute relevance and text quality and achieves the SOTA. Further analysis also supplies some explanatory support for the effectiveness of our approach.

2021

Learning to Rewrite for Non-Autoregressive Neural Machine Translation
Xinwei Geng | Xiaocheng Feng | Bing Qin
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Non-autoregressive neural machine translation, which decomposes the dependence on previous target tokens from the inputs of the decoder, has achieved impressive inference speedup but at the cost of inferior accuracy. Previous works employ iterative decoding to improve the translation by applying multiple refinement iterations. However, a serious drawback is that these approaches expose the serious weakness in recognizing the erroneous translation pieces. In this paper, we propose an architecture named RewriteNAT to explicitly learn to rewrite the erroneous translation pieces. Specifically, RewriteNAT utilizes a locator module to locate the erroneous ones, which are then revised into the correct ones by a revisor module. Towards keeping the consistency of data distribution with iterative decoding, an iterative training strategy is employed to further improve the capacity of rewriting. Extensive experiments conducted on several widely-used benchmarks show that RewriteNAT can achieve better performance while significantly reducing decoding time, compared with previous iterative decoding strategies. In particular, RewriteNAT can obtain competitive results with autoregressive translation on WMT14 En-De, En-Fr and WMT16 Ro-En translation benchmarks.

Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization
Xiachong Feng | Xiaocheng Feng | Libo Qin | Bing Qin | Ting Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Current dialogue summarization systems usually encode the text with a number of general semantic features (e.g., keywords and topics) to gain more powerful dialogue modeling capabilities. However, these features are obtained via open-domain toolkits that are dialog-agnostic or heavily relied on human annotations. In this paper, we show how DialoGPT, a pre-trained model for conversational response generation, can be developed as an unsupervised dialogue annotator, which takes advantage of dialogue background knowledge encoded in DialoGPT. We apply DialoGPT to label three types of features on two dialogue summarization datasets, SAMSum and AMI, and employ pre-trained and non pre-trained models as our summarizers. Experimental results show that our proposed method can obtain remarkable improvements on both datasets and achieves new state-of-the-art performance on the SAMSum dataset.

2020

CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng | Daya Guo | Duyu Tang | Nan Duan | Xiaocheng Feng | Ming Gong | Linjun Shou | Bing Qin | Ting Liu | Daxin Jiang | Ming Zhou
Findings of the Association for Computational Linguistics: EMNLP 2020

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators. This enables us to utilize both “bimodal” data of NL-PL pairs and “unimodal data, where the former provides input tokens for model training while the latter helps to learn better generators. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters. Results show that CodeBERT achieves state-of-the-art performance on both natural language code search and code documentation generation. Furthermore, to investigate what type of knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and evaluate in a zero-shot setting where parameters of pre-trained models are fixed. Results show that CodeBERT performs better than previous pre-trained models on NLPL probing.

Enhancing Content Planning for Table-to-Text Generation with Data Understanding and Verification
Heng Gong | Wei Bi | Xiaocheng Feng | Bing Qin | Xiaojiang Liu | Ting Liu
Findings of the Association for Computational Linguistics: EMNLP 2020

Neural table-to-text models, which select and order salient data, as well as verbalizing them fluently via surface realization, have achieved promising progress. Based on results from previous work, the performance bottleneck of current models lies in the stage of content planing (selecting and ordering salient content from the input). That is, performance drops drastically when an oracle content plan is replaced by a model-inferred one during surface realization. In this paper, we propose to enhance neural content planning by (1) understanding data values with contextual numerical value representations that bring the sense of value comparison into content planning; (2) verifying the importance and ordering of the selected sequence of records with policy gradient. We evaluated our model on ROTOWIRE and MLB, two datasets on this task, and results show that our model outperforms existing systems with respect to content planning metrics.

TableGPT: Few-shot Table-to-Text Generation with Table Structure Reconstruction and Content Matching
Heng Gong | Yawei Sun | Xiaocheng Feng | Bing Qin | Wei Bi | Xiaojiang Liu | Ting Liu
Proceedings of the 28th International Conference on Computational Linguistics

Although neural table-to-text models have achieved remarkable progress with the help of large-scale datasets, they suffer insufficient learning problem with limited training data. Recently, pre-trained language models show potential in few-shot learning with linguistic knowledge learnt from pretraining on large-scale corpus. However, benefiting table-to-text generation in few-shot setting with the powerful pretrained language model faces three challenges, including (1) the gap between the task’s structured input and the natural language input for pretraining language model. (2) The lack of modeling for table structure and (3) improving text fidelity with less incorrect expressions that are contradicting to the table. To address aforementioned problems, we propose TableGPT for table-to-text generation. At first, we utilize table transformation module with template to rewrite structured table in natural language as input for GPT-2. In addition, we exploit multi-task learning with two auxiliary tasks that preserve table’s structural information by reconstructing the structure from GPT-2’s representation and improving the text’s fidelity with content matching task aligning the table and information in the generated text. By experimenting on Humans, Songs and Books, three few-shot table-to-text datasets in different domains, our model outperforms existing systems on most few-shot settings.

2019

Enhancing Neural Data-To-Text Generation Models with External Background Knowledge
Shuang Chen | Jinpeng Wang | Xiaocheng Feng | Feng Jiang | Bing Qin | Chin-Yew Lin
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent neural models for data-to-text generation rely on massive parallel pairs of data and text to learn the writing knowledge. They often assume that writing knowledge can be acquired from the training data alone. However, when people are writing, they not only rely on the data but also consider related knowledge. In this paper, we enhance neural data-to-text models with external knowledge in a simple but effective way to improve the fidelity of generated text. Besides relying on parallel data and text as in previous work, our model attends to relevant external knowledge, encoded as a temporary memory, and combines this knowledge with the context representation of data before generating words. This allows the model to infer relevant facts which are not explicitly stated in the data table from an external knowledge source. Experimental results on twenty-one Wikipedia infobox-to-text datasets show our model, KBAtt, consistently improves a state-of-the-art model on most of the datasets. In addition, to quantify when and why external knowledge is effective, we design a metric, KBGain, which shows a strong correlation with the observed performance boost. This result demonstrates the relevance of external knowledge and sparseness of original data are the main factors affecting system performance.

Table-to-Text Generation with Effective Hierarchical Encoder on Three Dimensions (Row, Column and Time)
Heng Gong | Xiaocheng Feng | Bing Qin | Ting Liu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Although Seq2Seq models for table-to-text generation have achieved remarkable progress, modeling table representation in one dimension is inadequate. This is because (1) the table consists of multiple rows and columns, which means that encoding a table should not depend only on one dimensional sequence or set of records and (2) most of the tables are time series data (e.g. NBA game data, stock market data), which means that the description of the current table may be affected by its historical data. To address aforementioned problems, not only do we model each table cell considering other records in the same row, we also enrich table’s representation by modeling each table cell in context of other cells in the same column or with historical (time dimension) data respectively. In addition, we develop a table cell fusion gate to combine representations from row, column and time dimension into one dense vector according to the saliency of each dimension’s representation. We evaluated our methods on ROTOWIRE, a benchmark dataset of NBA basketball games. Both automatic and human evaluation results demonstrate the effectiveness of our model with improvement of 2.66 in BLEU over the strong baseline and outperformance of state-of-the-art model.

2018

Semantic Parsing with Syntax- and Table-Aware SQL Generation
Yibo Sun | Duyu Tang | Nan Duan | Jianshu Ji | Guihong Cao | Xiaocheng Feng | Bing Qin | Ting Liu | Ming Zhou
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a generative model to map natural language questions into SQL queries. Existing neural network based approaches typically generate a SQL query word-by-word, however, a large portion of the generated results is incorrect or not executable due to the mismatch between question words and table contents. Our approach addresses this problem by considering the structure of table and the syntax of SQL language. The quality of the generated SQL query is significantly improved through (1) learning to replicate content from column names, cells or SQL keywords; and (2) improving the generation of WHERE clause by leveraging the column-cell relation. Experiments are conducted on WikiSQL, a recently released dataset with the largest question- SQL pairs. Our approach significantly improves the state-of-the-art execution accuracy from 69.0% to 74.4%.

Adaptive Multi-pass Decoder for Neural Machine Translation
Xinwei Geng | Xiaocheng Feng | Bing Qin | Ting Liu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Although end-to-end neural machine translation (NMT) has achieved remarkable progress in the recent years, the idea of adopting multi-pass decoding mechanism into conventional NMT is not well explored. In this paper, we propose a novel architecture called adaptive multi-pass decoder, which introduces a flexible multi-pass polishing mechanism to extend the capacity of NMT via reinforcement learning. More specifically, we adopt an extra policy network to automatically choose a suitable and effective number of decoding passes, according to the complexity of source sentences and the quality of the generated translations. Extensive experiments on Chinese-English translation demonstrate the effectiveness of our proposed adaptive multi-pass decoder upon the conventional NMT with a significant improvement about 1.55 BLEU.

2016

Bitext Name Tagging for Cross-lingual Entity Annotation Projection
Dongxu Zhang | Boliang Zhang | Xiaoman Pan | Xiaocheng Feng | Heng Ji | Weiran Xu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Annotation projection is a practical method to deal with the low resource problem in incident languages (IL) processing. Previous methods on annotation projection mainly relied on word alignment results without any training process, which led to noise propagation caused by word alignment errors. In this paper, we focus on the named entity recognition (NER) task and propose a weakly-supervised framework to project entity annotations from English to IL through bitexts. Instead of directly relying on word alignment results, this framework combines advantages of rule-based methods and deep learning methods by implementing two steps: First, generates a high-confidence entity annotation set on IL side with strict searching methods; Second, uses this high-confidence set to weakly supervise the model training. The model is finally used to accomplish the projecting process. Experimental results on two low-resource ILs show that the proposed method can generate better annotations projected from English-IL parallel corpora. The performance of IL name tagger can also be improved significantly by training on the newly projected IL annotation set.

English-Chinese Knowledge Base Translation with Neural Network
Xiaocheng Feng | Duyu Tang | Bing Qin | Ting Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Knowledge base (KB) such as Freebase plays an important role for many natural language processing tasks. English knowledge base is obviously larger and of higher quality than low resource language like Chinese. To expand Chinese KB by leveraging English KB resources, an effective way is to translate English KB (source) into Chinese (target). In this direction, two major challenges are to model triple semantics and to build a robust KB translator. We address these challenges by presenting a neural network approach, which learns continuous triple representation with a gated neural network. Accordingly, source triples and target triples are mapped in the same semantic vector space. We build a new dataset for English-Chinese KB translation from Freebase, and compare with several baselines on it. Experimental results show that the proposed method improves translation accuracy compared with baseline methods. We show that adaptive composition model improves standard solution such as neural tensor network in terms of translation accuracy.

A Language-Independent Neural Network for Event Detection
Xiaocheng Feng | Lifu Huang | Duyu Tang | Heng Ji | Bing Qin | Ting Liu
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Effective LSTMs for Target-Dependent Sentiment Classification
Duyu Tang | Bing Qin | Xiaocheng Feng | Ting Liu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Target-dependent sentiment classification remains a challenge: modeling the semantic relatedness of a target with its context words in a sentence. Different context words have different influences on determining the sentiment polarity of a sentence towards the target. Therefore, it is desirable to integrate the connections between target word and context words when building a learning system. In this paper, we develop two target dependent long short-term memory (LSTM) models, where target information is automatically taken into account. We evaluate our methods on a benchmark dataset from Twitter. Empirical results show that modeling sentence representation with standard LSTM does not perform well. Incorporating target information into LSTM can significantly boost the classification accuracy. The target-dependent LSTM models achieve state-of-the-art performances without using syntactic parser or external sentiment lexicons.

Liberal Event Extraction and Event Schema Induction
Lifu Huang | Taylor Cassidy | Xiaocheng Feng | Heng Ji | Clare R. Voss | Jiawei Han | Avirup Sil
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Co-authors

Yichong Huang 10

Weihong Zhong 10

Liang Zhao (赵亮) 5

Kun Zhu (朱坤) 5

Zhangyin Feng 3

Tong Xiao (肖桐) 3

Xiaojiang Liu 2

Lingyuan Zhang 2

Taylor Cassidy 1

Qianglong Chen 1

Jingchang Chen 1

Tat-Seng Chua 1

Shujian Huang (书剑黄) 1

Feng Jiang (蒋峰) 1

Lingpeng Kong 1

Xiaoliang Yang 1

Boliang Zhang 1

Venues