Pretraining-based neural network models have demonstrated state-of-the-art (SOTA) performances on natural language processing (NLP) tasks. The most frequently used sentence representation for neural-based NLP methods is a sequence of subwords that is different from the sentence representation of non-neural methods that are created using basic NLP technologies, such as part-of-speech (POS) tagging, named entity (NE) recognition, and parsing. Most neural-based NLP models receive only vectors encoded from a sequence of subwords obtained from an input text. However, basic NLP information, such as POS tags, NEs, parsing results, etc, cannot be obtained explicitly from only the large unlabeled text used in pretraining-based models. This paper explores use of NEs on two Japanese tasks; document classification and headline generation using Transformer-based models, to reveal the effectiveness of basic NLP information. The experimental results with eight basic NEs and approximately 200 extended NEs show that NEs improve accuracy although a large pretraining-based model trained using 70 GB text data was used.
By predicting chemical compound structures from their names, we can better comprehend chemical compounds written in text and identify the same chemical compound given different notations for database creation. Previous methods have predicted the chemical compound structures from their names and represented them by Simplified Molecular Input Line Entry System (SMILES) strings. However, these methods mainly apply handcrafted rules, and cannot predict the structures of chemical compound names not covered by the rules. Instead of handcrafted rules, we propose Transformer-based models that predict SMILES strings from chemical compound names. We improve the conventional Transformer-based model by introducing two features: (1) a loss function that constrains the number of atoms of each element in the structure, and (2) a multi-task learning approach that predicts both SMILES strings and InChI strings (another string representation of chemical compound structures). In evaluation experiments, our methods achieved higher F-measures than previous rule-based approaches (Open Parser for Systematic IUPAC Nomenclature and two commercially used products), and the conventional Transformer-based model. We release the dataset used in this paper as a benchmark for the future research.