Tiejun Zhao

Also published as: Tie-Jun Zhao, Tie-jun Zhao, TieJun Zhao


2021

pdf bib
Discriminative Reasoning for Document-level Relation Extraction
Wang Xu | Kehai Chen | Tiejun Zhao
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
HITMI&T at SemEval-2021 Task 5: Integrating Transformer and CRF for Toxic Spans Detection
Chenyi Wang | Tianshu Liu | Tiejun Zhao
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper introduces our system at SemEval-2021 Task 5: Toxic Spans Detection. The task aims to accurately locate toxic spans within a text. Using BIO tagging scheme, we model the task as a token-level sequence labeling task. Our system uses a single model built on the model of multi-layer bidirectional transformer encoder. And we introduce conditional random field (CRF) to make the model learn the constraints between tags. We use ERNIE as pre-trained model, which is more suitable for the task accroding to our experiments. In addition, we use adversarial training with the fast gradient method (FGM) to improve the robustness of the system. Our system obtains 69.85% F1 score, ranking 3rd for the official evaluation.

pdf bib
Self-Training for Unsupervised Neural Machine Translation in Unbalanced Training Data Scenarios
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks. However, in real-world scenarios, massive monolingual corpora do not exist for some extremely low-resource languages such as Estonian, and UNMT systems usually perform poorly when there is not adequate training corpus for one language. In this paper, we first define and analyze the unbalanced training data scenario for UNMT. Based on this scenario, we propose UNMT self-training mechanisms to train a robust UNMT system and improve its performance in this case. Experimental results on several language pairs show that the proposed methods substantially outperform conventional UNMT systems.

pdf bib
Issues with Entailment-based Zero-shot Text Classification
Tingting Ma | Jin-Ge Yao | Chin-Yew Lin | Tiejun Zhao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The general format of natural language inference (NLI) makes it tempting to be used for zero-shot text classification by casting any target label into a sentence of hypothesis and verifying whether or not it could be entailed by the input, aiming at generic classification applicable on any specified label space. In this opinion piece, we point out a few overlooked issues that are yet to be discussed in this line of work. We observe huge variance across different classification datasets amongst standard BERT-based NLI models and surprisingly find that pre-trained BERT without any fine-tuning can yield competitive performance against BERT fine-tuned for NLI. With the concern that these models heavily rely on spurious lexical patterns for prediction, we also experiment with preliminary approaches for more robust NLI, but the results are in general negative. Our observations reveal implicit but challenging difficulties in entailment-based zero-shot text classification.

2020

pdf bib
CN-HIT-MI.T at SemEval-2020 Task 8: Memotion Analysis Based on BERT
Zhen Li | Yaojie Zhang | Bing Xu | Tiejun Zhao
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Internet memes emotion recognition is focused by many researchers. In this paper, we adopt BERT and ResNet for evaluation of detecting the emotions of Internet memes. We focus on solving the problem of data imbalance and data contains noise. We use RandAugment to enhance the data of the picture, and use Training Signal Annealing (TSA) to solve the impact of the imbalance of the label. At the same time, a new loss function is designed to ensure that the model is not affected by input noise which will improve the robustness of the model. We participated in sub-task a and our model based on BERT obtains 34.58% macro F1 score, ranking 10/32.

pdf bib
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. However, it can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. That is, research on multilingual UNMT has been limited. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder, making use of multilingual data to improve UNMT for all language pairs. On the basis of the empirical findings, we propose two knowledge distillation methods to further enhance multilingual UNMT performance. Our experiments on a dataset with English translated to and from twelve other languages (including three language families and six language branches) show remarkable results, surpassing strong unsupervised individual baselines while achieving promising performance between non-English language pairs in zero-shot translation scenarios and alleviating poor performance in low-resource language pairs.

pdf bib
Demographics Should Not Be the Reason of Toxicity: Mitigating Discrimination in Text Classifications with Instance Weighting
Guanhua Zhang | Bing Bai | Junqi Zhang | Kun Bai | Conghui Zhu | Tiejun Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

With the recent proliferation of the use of text classifications, researchers have found that there are certain unintended biases in text classification datasets. For example, texts containing some demographic identity-terms (e.g., “gay”, “black”) are more likely to be abusive in existing abusive language detection datasets. As a result, models trained with these datasets may consider sentences like “She makes me happy to be gay” as abusive simply because of the word “gay.” In this paper, we formalize the unintended biases in text classification datasets as a kind of selection bias from the non-discrimination distribution to the discrimination distribution. Based on this formalization, we further propose a model-agnostic debiasing training framework by recovering the non-discrimination distribution using instance weighting, which does not require any extra resources or annotations apart from a pre-defined set of demographic identity-terms. Experiments demonstrate that our method can effectively alleviate the impacts of the unintended biases without significantly hurting models’ generalization ability.

pdf bib
CAN-GRU: a Hierarchical Model for Emotion Recognition in Dialogue
Ting Jiang | Bing Xu | Tiejun Zhao | Sheng Li
Proceedings of the 19th Chinese National Conference on Computational Linguistics

Emotion recognition in dialogue systems has gained attention in the field of natural language processing recent years, because it can be applied in opinion mining from public conversational data on social media. In this paper, we propose a hierarchical model to recognize emotions in the dialogue. In the first layer, in order to extract textual features of utterances, we propose a convolutional self-attention network(CAN). Convolution is used to capture n-gram information and attention mechanism is used to obtain the relevant semantic information among words in the utterance. In the second layer, a GRU-based network helps to capture contextual information in the conversation. Furthermore, we discuss the effects of unidirectional and bidirectional networks. We conduct experiments on Friends dataset and EmotionPush dataset. The results show that our proposed model(CAN-GRU) and its variants achieve better performance than baselines.

pdf bib
Robust Machine Reading Comprehension by Learning Soft labels
Zhenyu Zhao | Shuangzhi Wu | Muyun Yang | Kehai Chen | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics

Neural models have achieved great success on the task of machine reading comprehension (MRC), which are typically trained on hard labels. We argue that hard labels limit the model capability on generalization due to the label sparseness problem. In this paper, we propose a robust training method for MRC models to address this problem. Our method consists of three strategies, 1) label smoothing, 2) word overlapping, 3) distribution prediction. All of them help to train models on soft labels. We validate our approach on the representative architecture - ALBERT. Experimental results show that our method can greatly boost the baseline with 1% improvement in average, and achieve state-of-the-art performance on NewsQA and QUOREF.

pdf bib
Robust Unsupervised Neural Machine Translation with Adversarial Denoising Training
Haipeng Sun | Rui Wang | Kehai Chen | Xugang Lu | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics

Unsupervised neural machine translation (UNMT) has recently attracted great interest in the machine translation community. The main advantage of the UNMT lies in its easy collection of required large training text sentences while with only a slightly worse performance than supervised neural machine translation which requires expensive annotated translation pairs on some translation tasks. In most studies, the UMNT is trained with clean data without considering its robustness to the noisy data. However, in real-world scenarios, there usually exists noise in the collected input sentences which degrades the performance of the translation system since the UNMT is sensitive to the small perturbations of the input sentences. In this paper, we first time explicitly take the noisy data into consideration to improve the robustness of the UNMT based systems. First of all, we clearly defined two types of noises in training sentences, i.e., word noise and word order noise, and empirically investigate its effect in the UNMT, then we propose adversarial training methods with denoising process in the UNMT. Experimental results on several language pairs show that our proposed methods substantially improved the robustness of the conventional UNMT systems in noisy scenarios.

pdf bib
Learning to Decouple Relations: Few-Shot Relation Classification with Entity-Guided Attention and Confusion-Aware Training
Yingyao Wang | Junwei Bao | Guangyi Liu | Youzheng Wu | Xiaodong He | Bowen Zhou | Tiejun Zhao
Proceedings of the 28th International Conference on Computational Linguistics

This paper aims to enhance the few-shot relation classification especially for sentences that jointly describe multiple relations. Due to the fact that some relations usually keep high co-occurrence in the same context, previous few-shot relation classifiers struggle to distinguish them with few annotated instances. To alleviate the above relation confusion problem, we propose CTEG, a model equipped with two novel mechanisms to learn to decouple these easily-confused relations. On the one hand, an Entity -Guided Attention (EGA) mechanism, which leverages the syntactic relations and relative positions between each word and the specified entity pair, is introduced to guide the attention to filter out information causing confusion. On the other hand, a Confusion-Aware Training (CAT) method is proposed to explicitly learn to distinguish relations by playing a pushing-away game between classifying a sentence into a true relation and its confusing relation. Extensive experiments are conducted on the FewRel dataset, and the results show that our proposed model achieves comparable and even much better results to strong baselines in terms of accuracy. Furthermore, the ablation test and case study verify the effectiveness of our proposed EGA and CAT, especially in addressing the relation confusion problem.

pdf bib
Cross Copy Network for Dialogue Generation
Changzhen Ji | Xin Zhou | Yating Zhang | Xiaozhong Liu | Changlong Sun | Conghui Zhu | Tiejun Zhao
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In the past few years, audiences from different fields witness the achievements of sequence-to-sequence models (e.g., LSTM+attention, Pointer Generator Networks and Transformer) to enhance dialogue content generation. While content fluency and accuracy often serve as the major indicators for model training, dialogue logics, carrying critical information for some particular domains, are often ignored. Take customer service and court debate dialogue as examples, compatible logics can be observed across different dialogue instances, and this information can provide vital evidence for utterance generation. In this paper, we propose a novel network architecture - Cross Copy Networks (CCN) to explore the current dialog context and similar dialogue instances’ logical structure simultaneously. Experiments with two tasks, court debate and customer service content generation, proved that the proposed algorithm is superior to existing state-of-art content generation models.

pdf bib
End-to-End Speech Translation with Adversarial Training
Xuancai Li | Chen Kehai | Tiejun Zhao | Muyun Yang
Proceedings of the First Workshop on Automatic Simultaneous Translation

End-to-End speech translation usually leverages audio-to-text parallel data to train an available speech translation model which has shown impressive results on various speech translation tasks. Due to the artificial cost of collecting audio-to-text parallel data, the speech translation is a natural low-resource translation scenario, which greatly hinders its improvement. In this paper, we proposed a new adversarial training method to leverage target monolingual data to relieve the low-resource shortcoming of speech translation. In our method, the existing speech translation model is considered as a Generator to gain a target language output, and another neural Discriminator is used to guide the distinction between outputs of speech translation model and true target monolingual sentences. Experimental results on the CCMT 2019-BSTC dataset speech translation task demonstrate that the proposed methods can significantly improve the performance of the End-to-End speech translation system.

2019

pdf bib
CN-HIT-MI.T at SemEval-2019 Task 6: Offensive Language Identification Based on BiLSTM with Double Attention
Yaojie Zhang | Bing Xu | Tiejun Zhao
Proceedings of the 13th International Workshop on Semantic Evaluation

Offensive language has become pervasive in social media. In Offensive Language Identification tasks, it may be difficult to predict accurately only according to the surface words. So we try to dig deeper semantic information of text. This paper presents use an attention-based two layers bidirectional longshort memory neural network (BiLSTM) for semantic feature extraction. Additionally, a residual connection mechanism is used to synthesize two different deep features, and an emoji attention mechanism is used to extract semantic information of emojis in text. We participated in three sub-tasks of SemEval 2019 Task 6 as CN-HIT-MI.T team. Our macro-averaged F1-score in sub-task A is 0.768, ranking 28/103. We got 0.638 in sub-task B, ranking 30/75. In sub-task C, we got 0.549, ranking 22/65. We also tried some other methods of not submitting results.

pdf bib
Understanding Data Augmentation in Neural Machine Translation: Two Perspectives towards Generalization
Guanlin Li | Lemao Liu | Guoping Huang | Conghui Zhu | Tiejun Zhao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many Data Augmentation (DA) methods have been proposed for neural machine translation. Existing works measure the superiority of DA methods in terms of their performance on a specific test set, but we find that some DA methods do not exhibit consistent improvements across translation tasks. Based on the observation, this paper makes an initial attempt to answer a fundamental question: what benefits, which are consistent across different methods and tasks, does DA in general obtain? Inspired by recent theoretic advances in deep learning, the paper understands DA from two perspectives towards the generalization ability of a model: input sensitivity and prediction margin, which are defined independent of specific test set thereby may lead to findings with relatively low variance. Extensive experiments show that relatively consistent benefits across five DA methods and four translation tasks are achieved regarding both perspectives.

pdf bib
Unsupervised Bilingual Word Embedding Agreement for Unsupervised Neural Machine Translation
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Unsupervised bilingual word embedding (UBWE), together with other technologies such as back-translation and denoising, has helped unsupervised neural machine translation (UNMT) achieve remarkable results in several language pairs. In previous methods, UBWE is first trained using non-parallel monolingual corpora and then this pre-trained UBWE is used to initialize the word embedding in the encoder and decoder of UNMT. That is, the training of UBWE and UNMT are separate. In this paper, we first empirically investigate the relationship between UBWE and UNMT. The empirical findings show that the performance of UNMT is significantly affected by the performance of UBWE. Thus, we propose two methods that train UNMT with UBWE agreement. Empirical results on several language pairs show that the proposed methods significantly outperform conventional UNMT.

pdf bib
Sentence-Level Agreement for Neural Machine Translation
Mingming Yang | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Min Zhang | Tiejun Zhao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

The training objective of neural machine translation (NMT) is to minimize the loss between the words in the translated sentences and those in the references. In NMT, there is a natural correspondence between the source sentence and the target sentence. However, this relationship has only been represented using the entire neural network and the training objective is computed in word-level. In this paper, we propose a sentence-level agreement module to directly minimize the difference between the representation of source and target sentence. The proposed agreement module can be integrated into NMT as an additional training objective function and can also be used to enhance the representation of the source sentences. Empirical results on the NIST Chinese-to-English and WMT English-to-German tasks show the proposed agreement module can significantly improve the NMT performance.

pdf bib
Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets
Guanhua Zhang | Bing Bai | Jian Liang | Kun Bai | Shiyu Chang | Mo Yu | Conghui Zhu | Tiejun Zhao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Natural Language Sentence Matching (NLSM) has gained substantial attention from both academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily bring unintended pattern, i.e., selection bias. One example is the QuoraQP dataset, where some content-independent naive features are unreasonably predictive. Such features are the reflection of the selection bias and termed as the “leakage features.” In this paper, we investigate the problem of selection bias on six NLSM datasets and find that four out of them are significantly biased. We further propose a training and evaluation framework to alleviate the bias. Experimental results on QuoraQP suggest that the proposed framework can improve the generalization ability of trained models, and give more trustworthy evaluation results for real-world adoptions.

pdf bib
Understanding and Improving Hidden Representations for Neural Machine Translation
Guanlin Li | Lemao Liu | Xintong Li | Conghui Zhu | Tiejun Zhao | Shuming Shi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Multilayer architectures are currently the gold standard for large-scale neural machine translation. Existing works have explored some methods for understanding the hidden representations, however, they have not sought to improve the translation quality rationally according to their understanding. Towards understanding for performance improvement, we first artificially construct a sequence of nested relative tasks and measure the feature generalization ability of the learned hidden representation over these tasks. Based on our understanding, we then propose to regularize the layer-wise representations with all tree-induced tasks. To overcome the computational bottleneck resulting from the large number of regularization terms, we design efficient approximation methods by selecting a few coarse-to-fine tasks for regularization. Extensive experiments on two widely-used datasets demonstrate the proposed methods only lead to small extra overheads in training but no additional overheads in testing, and achieve consistent improvements (up to +1.3 BLEU) compared to the state-of-the-art translation model.

pdf bib
Improving Neural Machine Translation with Neural Syntactic Distance
Chunpeng Ma | Akihiro Tamura | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The explicit use of syntactic information has been proved useful for neural machine translation (NMT). However, previous methods resort to either tree-structured neural networks or long linearized sequences, both of which are inefficient. Neural syntactic distance (NSD) enables us to represent a constituent tree using a sequence whose length is identical to the number of words in the sentence. NSD has been used for constituent parsing, but not in machine translation. We propose five strategies to improve NMT with NSD. Experiments show that it is not trivial to improve NMT with NSD; however, the proposed strategies are shown to improve translation performance of the baseline model (+2.1 (En–Ja), +1.3 (Ja–En), +1.2 (En–Ch), and +1.0 (Ch–En) BLEU).

2018

pdf bib
Neural Document Summarization by Jointly Learning to Score and Select Sentences
Qingyu Zhou | Nan Yang | Furu Wei | Shaohan Huang | Ming Zhou | Tiejun Zhao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Sentence scoring and sentence selection are two main steps in extractive document summarization systems. However, previous works treat them as two separated subtasks. In this paper, we present a novel end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences. It first reads the document sentences with a hierarchical encoder to obtain the representation of sentences. Then it builds the output summary by extracting sentences one by one. Different from previous methods, our approach integrates the selection strategy into the scoring model, which directly predicts the relative importance given previously selected sentences. Experiments on the CNN/Daily Mail dataset show that the proposed framework significantly outperforms the state-of-the-art extractive summarization models.

pdf bib
Forest-Based Neural Machine Translation
Chunpeng Ma | Akihiro Tamura | Masao Utiyama | Tiejun Zhao | Eiichiro Sumita
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tree-based neural machine translation (NMT) approaches, although achieved impressive performance, suffer from a major drawback: they only use the 1-best parse tree to direct the translation, which potentially introduces translation mistakes due to parsing errors. For statistical machine translation (SMT), forest-based methods have been proven to be effective for solving this problem, while for NMT this kind of approach has not been attempted. This paper proposes a forest-based NMT method that translates a linearized packed forest under a simple sequence-to-sequence framework (i.e., a forest-to-sequence NMT model). The BLEU score of the proposed method is higher than that of the sequence-to-sequence NMT, tree-based NMT, and forest-based SMT systems.

2017

pdf bib
Context-Aware Smoothing for Neural Machine Translation
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In Neural Machine Translation (NMT), each word is represented as a low-dimension, real-value vector for encoding its syntax and semantic information. This means that even if the word is in a different sentence context, it is represented as the fixed vector to learn source representation. Moreover, a large number of Out-Of-Vocabulary (OOV) words, which have different syntax and semantic information, are represented as the same vector representation of “unk”. To alleviate this problem, we propose a novel context-aware smoothing method to dynamically learn a sentence-specific vector for each word (including OOV words) depending on its local context words in a sentence. The learned context-aware representation is integrated into the NMT to improve the translation performance. Empirical results on NIST Chinese-to-English translation task show that the proposed approach achieves 1.78 BLEU improvements on average over a strong attentional NMT, and outperforms some existing systems.

pdf bib
Neural Machine Translation with Source Dependency Representation
Kehai Chen | Rui Wang | Masao Utiyama | Lemao Liu | Akihiro Tamura | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Source dependency information has been successfully introduced into statistical machine translation. However, there are only a few preliminary attempts for Neural Machine Translation (NMT), such as concatenating representations of source word and its dependency label together. In this paper, we propose a novel NMT with source dependency representation to improve translation performance of NMT, especially long sentences. Empirical results on NIST Chinese-to-English translation task show that our method achieves 1.6 BLEU improvements on average over a strong NMT system.

2016

pdf bib
A Distribution-based Model to Learn Bilingual Word Embeddings
Hailong Cao | Tiejun Zhao | Shu Zhang | Yao Meng
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We introduce a distribution based model to learn bilingual word embeddings from monolingual data. It is simple, effective and does not require any parallel data or any seed lexicon. We take advantage of the fact that word embeddings are usually in form of dense real-valued low-dimensional vector and therefore the distribution of them can be accurately estimated. A novel cross-lingual learning objective is proposed which directly matches the distributions of word embeddings in one language with that in the other language. During the joint learning process, we dynamically estimate the distributions of word embeddings in two languages respectively and minimize the dissimilarity between them through standard back propagation algorithm. Our learned bilingual word embeddings allow to group each word and its translations together in the shared vector space. We demonstrate the utility of the learned embeddings on the task of finding word-to-word translations from monolingual corpora. Our model achieved encouraging performance on data in both related languages and substantially different languages.

pdf bib
Constraint-Based Question Answering with Knowledge Graph
Junwei Bao | Nan Duan | Zhao Yan | Ming Zhou | Tiejun Zhao
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

WebQuestions and SimpleQuestions are two benchmark data-sets commonly used in recent knowledge-based question answering (KBQA) work. Most questions in them are ‘simple’ questions which can be answered based on a single relation in the knowledge base. Such data-sets lack the capability of evaluating KBQA systems on complicated questions. Motivated by this issue, we release a new data-set, namely ComplexQuestions, aiming to measure the quality of KBQA systems on ‘multi-constraint’ questions which require multiple knowledge base relations to get the answer. Beside, we propose a novel systematic KBQA approach to solve multi-constraint questions. Compared to state-of-the-art methods, our approach not only obtains comparable results on the two existing benchmark data-sets, but also achieves significant improvements on the ComplexQuestions.

pdf bib
Building A Case-based Semantic English-Chinese Parallel Treebank
Huaxing Shi | Tiejun Zhao | Keh-Yih Su
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We construct a case-based English-to-Chinese semantic constituent parallel Treebank for a Statistical Machine Translation (SMT) task by labelling each node of the Deep Syntactic Tree (DST) with our refined semantic cases. Since subtree span-crossing is harmful in tree-based SMT, DST is adopted to alleviate this problem. At the same time, we tailor an existing case set to represent bilingual shallow semantic relations more precisely. This Treebank is a part of a semantic corpus building project, which aims to build a semantic bilingual corpus annotated with syntactic, semantic cases and word senses. Data in our Treebank is from the news domain of Datum corpus. 4,000 sentence pairs are selected to cover various lexicons and part-of-speech (POS) n-gram patterns as much as possible. This paper presents the construction of this case Treebank. Also, we have tested the effect of adopting DST structure in alleviating subtree span-crossing. Our preliminary analysis shows that the compatibility between Chinese and English trees can be significantly increased by transforming the parse-tree into the DST. Furthermore, the human agreement rate in annotation is found to be acceptable (90% in English nodes, 75% in Chinese nodes).

2015

pdf bib
Efficient Disfluency Detection with Transition-based Parsing
Shuangzhi Wu | Dongdong Zhang | Ming Zhou | Tiejun Zhao
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib
Improving Pivot-Based Statistical Machine Translation by Pivoting the Co-occurrence Count of Phrase Pairs
Xiaoning Zhu | Zhongjun He | Hua Wu | Conghui Zhu | Haifeng Wang | Tiejun Zhao
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
A Lexicalized Reordering Model for Hierarchical Phrase-based Translation
Hailong Cao | Dongdong Zhang | Mu Li | Ming Zhou | Tiejun Zhao
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Soft Dependency Matching for Hierarchical Phrase-based Machine Translation
Hailong Cao | Dongdong Zhang | Ming Zhou | Tiejun Zhao
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Knowledge-Based Question Answering as Machine Translation
Junwei Bao | Nan Duan | Ming Zhou | Tiejun Zhao
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2013

pdf bib
Tuning SMT with a Large Number of Features via Online Feature Grouping
Lemao Liu | Tiejun Zhao | Taro Watanabe | Eiichiro Sumita
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Repairing Incorrect Translation with Examples
Junguo Zhu | Muyun Yang | Sheng Li | Tiejun Zhao
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Improving Pivot-Based Statistical Machine Translation Using Random Walk
Xiaoning Zhu | Zhongjun He | Hua Wu | Haifeng Wang | Conghui Zhu | Tiejun Zhao
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Additive Neural Networks for Statistical Machine Translation
Lemao Liu | Taro Watanabe | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Hierarchical Phrase Table Combination for Machine Translation
Conghui Zhu | Taro Watanabe | Eiichiro Sumita | Tiejun Zhao
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Cross-lingual Projections between Languages from Different Families
Mo Yu | Tiejun Zhao | Yalong Bai | Hao Tian | Dianhai Yu
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
A Tightly-coupled Unsupervised Clustering and Bilingual Alignment Model for Transliteration
Tingting Li | Tiejun Zhao | Andrew Finch | Chunyue Zhang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Compound Embedding Features for Semi-supervised Learning
Mo Yu | Tiejun Zhao | Daxiang Dong | Hao Tian | Dianhai Yu
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Locally Training the Log-Linear Model for SMT
Lemao Liu | Hailong Cao | Taro Watanabe | Tiejun Zhao | Mo Yu | Conghui Zhu
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Expected Error Minimization with Ultraconservative Update for SMT
Lemao Liu | Tiejun Zhao | Taro Watanabe | Hailong Cao | Conghui Zhu
Proceedings of COLING 2012: Posters

pdf bib
The HIT-LTRC machine translation system for IWSLT 2012
Xiaoning Zhu | Yiming Cui | Conghui Zhu | Tiejun Zhao | Hailong Cao
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

In this paper, we describe HIT-LTRC's participation in the IWSLT 2012 evaluation campaign. In this year, we took part in the Olympics Task which required the participants to translate Chinese to English with limited data. Our system is based on Moses[1], which is an open source machine translation system. We mainly used the phrase-based models to carry out our experiments, and factored-based models were also performed in comparison. All the involved tools are freely available. In the evaluation campaign, we focus on data selection, phrase extraction method comparison and phrase table combination.

pdf bib
Syllable-based Machine Transliteration with Extra Phrase Features
Chunyue Zhang | Tingting Li | Tiejun Zhao
Proceedings of the 4th Named Entity Workshop (NEWS) 2012

2011

pdf bib
A Unified and Discriminative Soft Syntactic Constraint Model for Hierarchical Phrase-based Translation
Lemao Liu | Tiejun Zhao | Chao Wang | Hailong Cao
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Hypergraph Training and Decoding of System Combination in SMT
Yupeng Liu | Tiejun Zhao | Sheng Li
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Target-dependent Twitter Sentiment Classification
Long Jiang | Mo Yu | Ming Zhou | Xiaohua Liu | Tiejun Zhao
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
A Joint Rule Selection Model for Hierarchical Phrase-Based Translation
Lei Cui | Dongdong Zhang | Mu Li | Ming Zhou | Tiejun Zhao
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
PKU_HIT: An Event Detection System Based on Instances Expansion and Rich Syntactic Features
Shiqi Li | Pengyuan Liu | Tiejun Zhao | Qin Lu | Hanjing Li
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
PengYuan@PKU: Extracting Infrequent Sense Instance with the Same N-Gram Pattern for the SemEval-2010 Task 15
Peng-Yuan Liu | Shi-Wen Yu | Shui Liu | Tie-Jun Zhao
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Hybrid Decoding: Decoding with Partial Hypotheses Combination over Multiple SMT Systems
Lei Cui | Dongdong Zhang | Mu Li | Ming Zhou | Tiejun Zhao
Coling 2010: Posters

pdf bib
Combining Constituent and Dependency Syntactic Views for Chinese Semantic Role Labeling
Shiqi Li | Qin Lu | Tiejun Zhao | Pengyuan Liu | Hanjing Li
Coling 2010: Posters

pdf bib
Reexamination on Potential for Personalization in Web Search
Daren Li | Muyun Yang | HaoLiang Qi | Sheng Li | Tiejun Zhao
Coling 2010: Posters

pdf bib
Head-modifier Relation based Non-lexical Reordering Model for Phrase-Based Translation
Shui Liu | Sheng Li | Tiejun Zhao | Min Zhang | Pengyuan Liu
Coling 2010: Posters

pdf bib
Utilizing Variability of Time and Term Content, within and across Users in Session Detection
Shuqi Sun | Sheng Li | Muyun Yang | Haoliang Qi | Tiejun Zhao
Coling 2010: Posters

pdf bib
All in Strings: a Powerful String-based Automatic MT Evaluation Metric with Multiple Granularities
Junguo Zhu | Muyun Yang | Bo Wang | Sheng Li | Tiejun Zhao
Coling 2010: Posters

pdf bib
Using Deep Belief Nets for Chinese Named Entity Categorization
Yu Chen | You Ouyang | Wenjie Li | Dequan Zheng | Tiejun Zhao
Proceedings of the 2010 Named Entities Workshop

pdf bib
Exploring Deep Belief Network for Chinese Relation Extraction
Yu Chen | Wenjie Li | Yan Liu | Dequan Zheng | Tiejun Zhao
CIPS-SIGHAN Joint Conference on Chinese Language Processing

2009

pdf bib
References Extension for the Automatic Evaluation of MT by Syntactic Hybridization
Bo Wang | Tiejun Zhao | Muyun Yang | Sheng Li
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
A Study of Translation Rule Classification for Syntax-based Statistical Machine Translation
Hongfei Jiang | Sheng Li | Muyun Yang | Tiejun Zhao
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
Train the Machine with What It Can Learn—Corpus Selection for SMT
Xiwu Han | Hanzhang Li | Tiejun Zhao
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora (BUCC)

pdf bib
A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar
Hongfei Jiang | Muyun Yang | Tiejun Zhao | Sheng Li | Bo Wang
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
Chinese Term Extraction Using Different Types of Relevance
Yuhang Yang | Tiejun Zhao | Qin Lu | Dequan Zheng | Hao Yu
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

2008

pdf bib
Chinese Term Extraction Based on Delimiters
Yuhang Yang | Qin Lu | Tiejun Zhao
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Existing techniques extract term candidates by looking for internal and contextual information associated with domain specific terms. The algorithms always face the dilemma that fewer features are not enough to distinguish terms from non-terms whereas more features lead to more conflicts among selected features. This paper presents a novel approach for term extraction based on delimiters which are much more stable and domain independent. The proposed approach is not as sensitive to term frequency as that of previous works. This approach has no strict limit or hard rules and thus they can deal with all kinds of terms. It also requires no prior domain knowledge and no additional training to adapt to new domains. Consequently, the proposed approach can be applied to different domains easily and it is especially useful for resource-limited domains. Evaluations conducted on two different domains for Chinese term extraction show significant improvements over existing techniques which verifies its efficiency and domain independent nature. Experiments on new term extraction indicate that the proposed approach can also serve as an effective tool for domain lexicon expansion.

pdf bib
Chinese Term Extraction Using Minimal Resources
Yuhang Yang | Qin Lu | Tiejun Zhao
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points
Ming Zhou | Bo Wang | Shujie Liu | Mu Li | Dongdong Zhang | Tiejun Zhao
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

2007

pdf bib
Meta-Structure Transformation Model for Statistical Machine Translation
Jiadong Sun | Tiejun Zhao | Huashen Liang
Proceedings of the Second Workshop on Statistical Machine Translation

pdf bib
The Extraction of Trajectories from Real Texts Based on Linear Classification
Hanjing Li | Tiejun Zhao | Sheng Li | Jiyuan Zhao
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

pdf bib
A Unified Tagging Approach to Text Normalization
Conghui Zhu | Jie Tang | Hang Li | Hwee Tou Ng | Tiejun Zhao
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
HIT-WSD: Using Search Engine for Multilingual Chinese-English Lexical Sample Task
PengYuan Liu | TieJun Zhao | MuYun Yang
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
Two-Fold Filtering for Chinese Subcategorization Acquisition with Diathesis Alternations Used as Heuristic Information
Xiwu Han | Tiejun Zhao
International Journal of Computational Linguistics & Chinese Language Processing, Volume 11, Number 2, June 2006

pdf bib
Improving English Subcategorization Acquisition with Diathesis Alternations as Heuristic Information
Xiwu Han | Tiejun Zhao | Xingshang Fu
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
A Hybrid Chinese Language Model based on a Combination of Ontology with Statistical Method
Dequan Zheng | Tiejun Zhao | Sheng Li | Hao Yu
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2004

pdf bib
Subcategorization Acquisition and Evaluation for Chinese Verbs
Xiwu Han | Tiejun Zhao | Haoliang Qi | Hao Yu
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

2002

pdf bib
Learning Chinese Bracketing Knowledge Based on a Bilingual Language Model
Yajuan Lü | Sheng Li | Tiejun Zhao | Muyun Yang
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
An Automatic Evaluation Method for Localization Oriented Lexicalised EBMT System
Jianmin Yao | Ming Zhou | Tiejun Zhao | Hao Yu | Sheng Li
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Automatic Information Transfer between English and Chinese
Jianmin Yao | Hao Yu | Tiejun Zhao | Xiaohong Li
COLING-02: Machine Translation in Asia

2001

pdf bib
Automatic Detection of Prosody Phrase Boundaries for Text-to-Speech System
Xin Lv | Tie-jun Zhao | Zhan-yi Liu | Mu-yun Yang
Proceedings of the Seventh International Workshop on Parsing Technologies

pdf bib
Automatic Translation Template Acquisition Based on Bilingual Structure Alignment
Yajuan Lu | Ming Zhou | Sheng Li | Changning Huang | Tiejun Zhao
International Journal of Computational Linguistics & Chinese Language Processing, Volume 6, Number 1, February 2001: Special Issue on Natural Language Processing Researches in MSRA

2000

pdf bib
Statistics Based Hybrid Approach to Chinese Base Phrase Identification
Tie-jun Zhao | Mu-yun Yang | Fang Liu | Jian-min Yao | Hao Yu
Second Chinese Language Processing Workshop

Search