What’s Hidden in a One-layer Randomly Weighted Transformer?
Sheng Shen | Zhewei Yao | Douwe Kiela | Kurt Keutzer | Michael Mahoney
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find subnetworks for one-layer randomly weighted neural networks, we apply different binary masks to the same weight matrix to generate different layers. Hidden within a one-layer randomly weighted Transformer, we find that subnetworks that can achieve 29.45/17.29 BLEU on IWSLT14/WMT14. Using a fixed pre-trained embedding layer, the previously found subnetworks are smaller than, but can match 98%/92% (34.14/25.24 BLEU) of the performance of, a trained Transformersmall/base on IWSLT14/WMT14. Furthermore, we demonstrate the effectiveness of larger and deeper transformers in this setting, as well as the impact of different initialization methods.
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding
Qinxin Wang | Hao Tan | Sheng Shen | Michael Mahoney | Zhewei Yao
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phrase-to-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more widely-available caption-image datasets, which can then be used as a form of weak supervision. We first present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. By adopting a contrastive objective, our method uses information in caption-image pairs to boost the performance in weakly-supervised scenarios. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods. With the help of the visually-aware language representations, we can also improve the previous best unsupervised result by 5.56%. We conduct ablation studies to show that both our novel model and our weakly-supervised strategies significantly contribute to our strong results.
- Sheng Shen 2
- Michael Mahoney 2
- Qinxin Wang 1
- Hao Tan 1
- Douwe Kiela 1
- show all...