Alex Smola


pdf bib
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise
Jiaao Chen | Aston Zhang | Mu Li | Alex Smola | Diyi Yang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Diffusion models that are based on iterative denoising have been recently proposed and leveraged in various generation tasks like image generation. Whereas, as a way inherently built for continuous data, existing diffusion models still have some limitations in modeling discrete data, e.g., languages. For example, the generally used Gaussian noise can not handle the discrete corruption well, and the objectives in continuous spaces fail to be stable for textual data in the diffusion process especially when the dimension is high. To alleviate these issues, we introduce a novel diffusion model for language modeling, Masked-Diffuse LM, with lower training cost and better performances, inspired by linguistic features in languages. Specifically, we design a linguistic-informed forward process which adds corruptions to the text through strategically soft-masking to better noise the textual data. Also, we directly predict the categorical distribution with cross-entropy loss function in every diffusion step to connect the continuous space and discrete space in a more efficient and straightforward way. Through experiments on 5 controlled generation tasks, we demonstrate that our Masked-Diffuse LM can achieve better generation quality than the state-of-the-art diffusion models with better efficiency.


pdf bib
Neural Machine Translation with Recurrent Attention Modeling
Zichao Yang | Zhiting Hu | Yuntian Deng | Chris Dyer | Alex Smola
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Knowing which words have been attended to in previous time steps while generating a translation is a rich source of information for predicting what words will be attended to in the future. We improve upon the attention model of Bahdanau et al. (2014) by explicitly modeling the relationship between previous and subsequent attention levels for each word using one recurrent network per input word. This architecture easily captures informative features, such as fertility and regularities in relative distortion. In experiments, we show our parameterization of attention improves translation quality.


pdf bib
Hierarchical Attention Networks for Document Classification
Zichao Yang | Diyi Yang | Chris Dyer | Xiaodong He | Alex Smola | Eduard Hovy
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies


pdf bib
Semi-Markov Models for Sequence Segmentation
Qinfeng Shi | Yasemin Altun | Alex Smola | S.V.N. Vishwanathan
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)