Tianfu Zhang


2021

pdf bib
Enlivening Redundant Heads in Multi-head Self-attention for Machine Translation
Tianfu Zhang | Heyan Huang | Chong Feng | Longbing Cao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multi-head self-attention recently attracts enormous interest owing to its specialized functions, significant parallelizable computation, and flexible extensibility. However, very recent empirical studies show that some self-attention heads make little contribution and can be pruned as redundant heads. This work takes a novel perspective of identifying and then vitalizing redundant heads. We propose a redundant head enlivening (RHE) method to precisely identify redundant heads, and then vitalize their potential by learning syntactic relations and prior knowledge in the text without sacrificing the roles of important heads. Two novel syntax-enhanced attention (SEA) mechanisms: a dependency mask bias and a relative local-phrasal position bias, are introduced to revise self-attention distributions for syntactic enhancement in machine translation. The importance of individual heads is dynamically evaluated during the redundant heads identification, on which we apply SEA to vitalize redundant heads while maintaining the strength of important heads. Experimental results on widely adopted WMT14 and WMT16 English to German and English to Czech language machine translation validate the RHE effectiveness.

2020

pdf bib
面向司法领域的高质量开源藏汉平行语料库构建(A High-quality Open Source Tibetan-Chinese Parallel Corpus Construction of Judicial Domain)
Jiu Sha (沙九) | Luqin Zhou (周鹭琴) | Chong Feng (冯冲) | Hongzheng Li (李洪政) | Tianfu Zhang (张天夫) | Hui Hui (慧慧)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

面向司法领域的藏汉机器翻译面临严重的数据稀疏问题。本文将从两个方面展录研究:第一,相比于通用领域,司法领域的藏语要有更严谨的逻辑表达和更多的专业术语。然而,目前藏语资源在司法领域内缺乏对应的语料,稀缺专业术语词以及句法结构。第二,藏语的特殊词汇表达方式和特定句法结构使得通用语料构建方法难以构建藏汉平行语料库。为此,本文提出仺种针对司法领域藏汉平行语料的轻量级构建方法。首先,我们采取人工标注获取一个中等规模的司法领域藏汉专业术语表作为先验知识库,以避免领域越界而产生的语料逻辑表达问题和领域术语缺失问题;其次,我们从全国的地方法庭官网采集实例语料数据,例如裁判文书。我们优先寻找藏文实例数据,其次是汉语,以避免后续构造藏语句子而丢失特殊的词汇表达和句式结构。我们基于以上原则采集藏汉语料构建高质量的藏汉平行语料库,具体方法包括:爬虫获取语料,规则断章对齐检测,语句边界识别,语料库自动清洗。朂终,我们构建了16万级规模的藏汉司法领域语料库,并通过多种翻译模型和交叉实验验证了构建的语料库的高质量特点和鲁棒性。另外,此语料库会弚源以便于相关研究人员用于科研工作。