Yingtao Tian


pdf bib
Learning Bilingual Word Embeddings Using Lexical Definitions
Weijia Shi | Muhao Chen | Yingtao Tian | Kai-Wei Chang
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Bilingual word embeddings, which represent lexicons of different languages in a shared embedding space, are essential for supporting semantic and knowledge transfers in a variety of cross-lingual NLP tasks. Existing approaches to training bilingual word embeddings require either large collections of pre-defined seed lexicons that are expensive to obtain, or parallel sentences that comprise coarse and noisy alignment. In contrast, we propose BiLex that leverages publicly available lexical definitions for bilingual word embedding learning. Without the need of predefined seed lexicons, BiLex comprises a novel word pairing strategy to automatically identify and propagate the precise fine-grain word alignment from lexical definitions. We evaluate BiLex in word-level and sentence-level translation tasks, which seek to find the cross-lingual counterparts of words and sentences respectively. BiLex significantly outperforms previous embedding methods on both tasks.

pdf bib
Learning to Represent Bilingual Dictionaries
Muhao Chen | Yingtao Tian | Haochen Chen | Kai-Wei Chang | Steven Skiena | Carlo Zaniolo
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Bilingual word embeddings have been widely used to capture the correspondence of lexical semantics in different human languages. However, the cross-lingual correspondence between sentences and words is less studied, despite that this correspondence can significantly benefit many applications such as crosslingual semantic search and textual inference. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the lexical definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. We conduct experiments on two new tasks. In the cross-lingual reverse dictionary retrieval task, we demonstrate that our model is capable of comprehending bilingual concepts based on descriptions, and the proposed learning strategies are effective. In the bilingual paraphrase identification task, we show that our model effectively associates sentences in different languages via a shared embedding space, and outperforms existing approaches in identifying bilingual paraphrases.


pdf bib
Simple Neologism Based Domain Independent Models to Predict Year of Authorship
Vivek Kulkarni | Yingtao Tian | Parth Dandiwala | Steve Skiena
Proceedings of the 27th International Conference on Computational Linguistics

We present domain independent models to date documents based only on neologism usage patterns. Our models capture patterns of neologism usage over time to date texts, provide insights into temporal locality of word usage over a span of 150 years, and generalize to various domains like News, Fiction, and Non-Fiction with competitive performance. Quite intriguingly, we show that by modeling only the distribution of usage counts over neologisms (the model being agnostic of the particular words themselves), we achieve competitive performance using several orders of magnitude fewer features (only 200 input features) compared to state of the art models some of which use 200K features.