Hidetoshi Shimodaira


pdf bib
Improving word mover’s distance by leveraging self-attention matrix
Hiroaki Yamagiwa | Sho Yokoi | Hidetoshi Shimodaira
Findings of the Association for Computational Linguistics: EMNLP 2023

Measuring the semantic similarity between two sentences is still an important task. The word mover’s distance (WMD) computes the similarity via the optimal alignment between the sets of word embeddings. However, WMD does not utilize word order, making it challenging to distinguish sentences with significant overlaps of similar words, even if they are semantically very different. Here, we attempt to improve WMD by incorporating the sentence structure represented by BERT’s self-attention matrix (SAM). The proposed method is based on the Fused Gromov-Wasserstein distance, which simultaneously considers the similarity of the word embedding and the SAM for calculating the optimal transport between two sentences. Experiments demonstrate the proposed method enhances WMD and its variants in paraphrase identification with near-equivalent performance in semantic textual similarity.

pdf bib
Norm of Word Embedding Encodes Information Gain
Momose Oyama | Sho Yokoi | Hidetoshi Shimodaira
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Distributed representations of words encode lexical semantic information, but what type of information is encoded and how? Focusing on the skip-gram with negative-sampling method, we found that the squared norm of static word embedding encodes the information gain conveyed by the word; the information gain is defined by the Kullback-Leibler divergence of the co-occurrence distribution of the word to the unigram distribution. Our findings are explained by the theoretical framework of the exponential family of probability distributions and confirmed through precise experiments that remove spurious correlations arising from word frequency. This theory also extends to contextualized word embeddings in language models or any neural networks with the softmax output layer. We also demonstrate that both the KL divergence and the squared norm of embedding provide a useful metric of the informativeness of a word in tasks such as keyword extraction, proper-noun discrimination, and hypernym discrimination.

pdf bib
Discovering Universal Geometry in Embeddings with ICA
Hiroaki Yamagiwa | Momose Oyama | Hidetoshi Shimodaira
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

This study utilizes Independent Component Analysis (ICA) to unveil a consistent semantic structure within embeddings of words or images. Our approach extracts independent semantic components from the embeddings of a pre-trained model by leveraging anisotropic information that remains after the whitening process in Principal Component Analysis (PCA). We demonstrate that each embedding can be expressed as a composition of a few intrinsic interpretable axes and that these semantic axes remain consistent across different languages, algorithms, and modalities. The discovery of a universal semantic structure in the geometric patterns of embeddings enhances our understanding of the representations in embeddings.


pdf bib
Segmentation-free compositional n-gram embedding
Geewook Kim | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a new type of representation learning method that models words, phrases and sentences seamlessly. Our method does not depend on word segmentation and any human-annotated resources (e.g., word dictionaries), yet it is very effective for noisy corpora written in unsegmented languages such as Chinese and Japanese. The main idea of our method is to ignore word boundaries completely (i.e., segmentation-free), and construct representations for all character n-grams in a raw corpus with embeddings of compositional sub-n-grams. Although the idea is simple, our experiments on various benchmarks and real-world datasets show the efficacy of our proposal.


pdf bib
Word-like character n-gram embedding
Geewook Kim | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

We propose a new word embedding method called word-like character n-gram embedding, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed segmentation-free word embedding, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of expected word frequency. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks.


pdf bib
Spectral Graph-Based Method of Multimodal Word Embedding
Kazuki Fukui | Takamasa Oshikiri | Hidetoshi Shimodaira
Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing

In this paper, we propose a novel method for multimodal word embedding, which exploit a generalized framework of multi-view spectral graph embedding to take into account visual appearances or scenes denoted by words in a corpus. We evaluated our method through word similarity tasks and a concept-to-image search task, having found that it provides word representations that reflect visual information, while somewhat trading-off the performance on the word similarity tasks. Moreover, we demonstrate that our method captures multimodal linguistic regularities, which enable recovering relational similarities between words and images by vector arithmetics.


pdf bib
Cross-Lingual Word Representations via Spectral Graph Embeddings
Takamasa Oshikiri | Kazuki Fukui | Hidetoshi Shimodaira
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)