Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task. BPE identifies the most frequent character sequences as basic units, while orthographic syllables are linguistically motivated pseudo-syllables. We show that BPE units modestly outperform orthographic syllables as units of translation, showing up to 11% increase in BLEU score. While orthographic syllables can be used only for languages whose writing systems use vowel representations, BPE is writing system independent and we show that BPE outperforms other units for non-vowel writing systems too. Our results are supported by extensive experimentation spanning multiple language families and writing systems.
Detecting neologisms is essential in real-time natural language processing applications. Not only can it enable to follow the lexical evolution of languages, but it is also essential for updating linguistic resources and parsers. In this paper, neology detection is considered as a classification task where a system has to assess whether a given lexical item is an actual neologism or not. We propose a combination of an unsupervised data mining technique and a supervised machine learning approach. It is inspired by current researches in stylometry and on token-level and character-level patterns. We train and evaluate our system on a manually designed reference dataset in French and Russian. We show that this approach is able to largely outperform state-of-the-art neology detection systems. Furthermore, character-level patterns exhibit good properties for multilingual extensions of the system.
In this study we address the problem of automated word stress detection in Russian using character level models and no part-speech-taggers. We use a simple bidirectional RNN with LSTM nodes and achieve accuracy of 90% or higher. We experiment with two training datasets and show that using the data from an annotated corpus is much more efficient than using only a dictionary, since it allows to retain the context of the word and its morphological features.
Word embedding has become a fundamental component to many NLP tasks such as named entity recognition and machine translation. However, popular models that learn such embeddings are unaware of the morphology of words, so it is not directly applicable to highly agglutinative languages such as Korean. We propose a syllable-based learning model for Korean using a convolutional neural network, in which word representation is composed of trained syllable vectors. Our model successfully produces morphologically meaningful representation of Korean words compared to the original Skip-gram embeddings. The results also show that it is quite robust to the Out-of-Vocabulary problem.
Recently, there has been increased interest in utilizing characters or subwords for natural language processing (NLP) tasks. However, the effect of utilizing character, subword, and word-level information simultaneously has not been examined so far. In this paper, we propose a model to leverage various levels of input features to improve on the performance of an supersense tagging task. Detailed analysis of experimental results show that different levels of input representation offer distinct characteristics that explain performance discrepancy among different tasks.
Most NLP resources that offer annotations at the word segment level provide morphological annotation that includes features indicating tense, aspect, modality, gender, case, and other inflectional information. Such information is rarely aligned to the relevant parts of the words—i.e. the allomorphs, as such annotation would be very costly. These unaligned weak labelings are commonly provided by annotated NLP corpora such as treebanks in various languages. Although they lack alignment information, the presence/absence of labels at the word level is also consistent with the amount of supervision assumed to be provided to L1 and L2 learners. In this paper, we explore several methods to learn this latent alignment between parts of word forms and the grammatical information provided. All the methods under investigation favor hypotheses regarding allomorphs of morphemes that re-use a small inventory, i.e. implicitly minimize the number of allomorphs that a morpheme can be realized as. We show that the provided information offers a significant advantage for both word segmentation and the learning of allomorphy.
We present a model for predicting word forms based on morphological relational reasoning with analogies. While previous work has explored tasks such as morphological inflection and reinflection, these models rely on an explicit enumeration of morphological features, which may not be available in all cases. To address the task of predicting a word form given a demo relation (a pair of word forms) and a query word, we devise a character-based recurrent neural network architecture using three separate encoders and a decoder. We also investigate a multiclass learning setup, where the prediction of the relation type label is used as an auxiliary task. Our results show that the exact form can be predicted for English with an accuracy of 94.7%. For Swedish, which has a more complex morphology with more inflectional patterns for nouns and verbs, the accuracy is 89.3%. We also show that using the auxiliary task of learning the relation type speeds up convergence and improves the prediction accuracy for the word generation task.
Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character’s glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character’s task-relevant semantic and syntactic information in the character-level embedding.
Multi-task training is an effective method to mitigate the data sparsity problem. It has recently been applied for cross-lingual transfer learning for paradigm completion—the task of producing inflected forms of lemmata—with sequence-to-sequence networks. However, it is still vague how the model transfers knowledge across languages, as well as if and which information is shared. To investigate this, we propose a set of data-dependent experiments using an existing encoder-decoder recurrent neural network for the task. Our results show that indeed the performance gains surpass a pure regularization effect and that knowledge about language and morphology can be transferred.
We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflection—the task of generating one inflected wordform from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use limited labeled data more effectively, obtaining up to 9.92% improvement over state-of-the-art baselines for 8 different languages.
We consider two related problems in this paper. Given an undeciphered alphabetic writing system or mono-alphabetic cipher, determine: (1) which of its letters are vowels and which are consonants; and (2) whether the writing system is a vocalic alphabet or an abjad. We are able to show that a very simple spectral decomposition based on character co-occurrences provides nearly perfect performance with respect to answering both question types.
We introduce a novel method to diminish the problem of out of vocabulary words by introducing an embedding method which leverages the agglutinative property of language. We propose additional embedding derived from syllables and morphemes for the words to improve the performance of language model. We apply the above method to input prediction tasks and achieve state of the art performance in terms of Key Stroke Saving (KSS) w.r.t. to existing device input prediction methods.
Recently, neural models have shown superior performance over conventional models in NER tasks. These models use CNN to extract sub-word information along with RNN to predict a tag for each word. However, these models have been tested almost entirely on English texts. It remains unclear whether they perform similarly in other languages. We worked on Japanese NER using neural models and discovered two obstacles of the state-of-the-art model. First, CNN is unsuitable for extracting Japanese sub-word information. Secondly, a model predicting a tag for each word cannot extract an entity when a part of a word composes an entity. The contributions of this work are (1) verifying the effectiveness of the state-of-the-art NER model for Japanese, (2) proposing a neural model for predicting a tag for each character using word and character information. Experimentally obtained results demonstrate that our model outperforms the state-of-the-art neural English NER model in Japanese.
Out-of-vocabulary words present a great challenge for Machine Translation. Recently various character-level compositional models were proposed to address this issue. In current research we incorporate two most popular neural architectures, namely LSTM and CNN, into hard- and soft-attentional models of translation for character-level representation of the source. We propose semantic and morphological intrinsic evaluation of encoder-level representations. Our analysis of the learned representations reveals that character-based LSTM seems to be better at capturing morphological aspects compared to character-based CNN. We also show that hard-attentional model provides better character-level representations compared to vanilla one.
There are several native languages in Peru which are mostly agglutinative. These languages are transmitted from generation to generation mainly in oral form, causing different forms of writing across different communities. For this reason, there are recent efforts to standardize the spelling in the written texts, and it would be beneficial to support these tasks with an automatic tool such as an spell-checker. In this way, this spelling corrector is being developed based on two steps: an automatic rule-based syllabification method and a character-level graph to detect the degree of error in a misspelled word. The experiments were realized on Shipibo-konibo, a highly agglutinative and amazonian language, and the results obtained have been promising in a dataset built for the purpose.
We propose a new type of subword embedding designed to provide more information about unknown compounds, a major source for OOV words in German. We present an extrinsic evaluation where we use the compound embeddings as input to a neural dependency parser and compare the results to the ones obtained with other types of embeddings. Our evaluation shows that adding compound embeddings yields a significant improvement of 2% LAS over using word embeddings when no POS information is available. When adding POS embeddings to the input, however, the effect levels out. This suggests that it is not the missing information about the semantics of the unknown words that causes problems for parsing German, but the lack of morphological information for unknown words. To augment our evaluation, we also test the new embeddings in a language modelling task that requires both syntactic and semantic information.
We present a general-purpose tagger based on convolutional neural networks (CNN), used for both composing word vectors and encoding context information. The CNN tagger is robust across different tagging tasks: without task-specific tuning of hyper-parameters, it achieves state-of-the-art results in part-of-speech tagging, morphological tagging and supertagging. The CNN tagger is also robust against the out-of-vocabulary problem; it performs well on artificially unnormalized texts.
Pre-trained word embeddings improve the performance of a neural model at the cost of increasing the model size. We propose to benefit from this resource without paying the cost by operating strictly at the sub-lexical level. Our approach is quite simple: before task-specific training, we first optimize sub-word parameters to reconstruct pre-trained word embeddings using various distance measures. We report interesting results on a variety of tasks: word similarity, word analogy, and part-of-speech tagging.
We present a novel supervised approach to inflection generation for verbs in Spanish. Our system takes as input the verb’s lemma form and the desired features such as person, number, tense, and is able to predict the appropriate grammatical conjugation. Even though our approach learns from fewer examples comparing to previous work, it is able to deal with all the Spanish moods (indicative, subjunctive and imperative) in contrast to previous work which only focuses on indicative and subjunctive moods. We show that in an intrinsic evaluation, our system achieves 99% accuracy, outperforming (although not significantly) two competitive state-of-art systems. The successful results obtained clearly indicate that our approach could be integrated into wider approaches related to text generation in Spanish.
We present a solution to the problem of paraphrase identification of questions. We focus on a recent dataset of question pairs annotated with binary paraphrase labels and show that a variant of the decomposable attention model (replacing the word embeddings of the decomposable attention model of Parikh et al. 2016 with character n-gram representations) results in accurate performance on this task, while being far simpler than many competing neural architectures. Furthermore, when the model is pretrained on a noisy dataset of automatically collected question paraphrases, it obtains the best reported performance on the dataset.
In East Asian languages such as Japanese and Chinese, the semantics of a character are (somewhat) reflected in its sub-character elements. This paper examines the effect of using sub-characters for language modeling in Japanese. This is achieved by decomposing characters according to a range of character decomposition datasets, and training a neural language model over variously decomposed character representations. Our results indicate that language modelling can be improved through the inclusion of sub-characters, though this result depends on a good choice of decomposition dataset and the appropriate granularity of decomposition.
This paper presents experiments comparing character-based and byte-based neural machine translation systems. The main motivation of the byte-based neural machine translation system is to build multi-lingual neural machine translation systems that can share the same vocabulary. We compare the performance of both systems in several language pairs and we see that the performance in test is similar for most language pairs while the training time is slightly reduced in the case of byte-based neural machine translation.
Fine-grained sentiment analysis is receiving increasing attention in recent years. Extracting opinion target expressions (OTE) in reviews is often an important step in fine-grained, aspect-based sentiment analysis. Retrieving this information from user-generated text, however, can be difficult. Customer reviews, for instance, are prone to contain misspelled words and are difficult to process due to their domain-specific language. In this work, we investigate whether character-level models can improve the performance for the identification of opinion target expressions. We integrate information about the character structure of a word into a sequence labeling system using character-level word embeddings and show their positive impact on the system’s performance. Specifically, we obtain an increase by 3.3 points F1-score with respect to our baseline model. In further experiments, we reveal encoded character patterns of the learned embeddings and give a nuanced view of the performance differences of both models.