Veronika Ganeeva
2024
Lost in Translation: Chemical Language Models and the Misunderstanding of Molecule Structures
Veronika Ganeeva
|
Andrey Sakhovskiy
|
Kuzma Khrabrov
|
Andrey Savchenko
|
Artur Kadurin
|
Elena Tutubalina
Findings of the Association for Computational Linguistics: EMNLP 2024
The recent integration of chemistry with natural language processing (NLP) has advanced drug discovery. Molecule representation in language models (LMs) is crucial in enhancing chemical understanding. We propose Augmented Molecular Retrieval (AMORE), a flexible zero-shot framework for assessment of Chemistry LMs of different natures: trained solely on molecules for chemical tasks and on a combined corpus of natural language texts and string-based structures. The framework relies on molecule augmentations that preserve an underlying chemical, such as kekulization and cycle replacements. We evaluate encoder-only and generative LMs by calculating a metric based on the similarity score between distributed representations of molecules and their augmentations. Our experiments on ChEBI-20 and QM9 benchmarks show that these models exhibit significantly lower scores than graph-based molecular models trained without language modeling objectives. Additionally, our results on the molecule captioning task for cross-domain models, MolT5 and Text+Chem T5, demonstrate that the lower the representation-based evaluation metrics, the lower the classical text generation metrics like ROUGE and METEOR.
2022
Razmecheno: Named Entity Recognition from Digital Archive of Diaries “Prozhito”
Timofey Atnashev
|
Veronika Ganeeva
|
Roman Kazakov
|
Daria Matyash
|
Michael Sonkin
|
Ekaterina Voloshina
|
Oleg Serikov
|
Ekaterina Artemova
Proceedings of the Fifth International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset “Razmecheno”, gathered from the diary texts of the project “Prozhito” in Russian. Our dataset is of interest for multiple research lines: literary studies of diary texts, transfer learning from other domains, low-resource or cross-lingual named entity recognition. Razmecheno comprises 1331 sentences and 14119 tokens, sampled from diaries, written during the Perestroika. The annotation schema consists of five commonly used entity tags: person, characteristics, location, organisation, and facility. The labelling is carried out on the crowdsourcing platfrom Yandex.Toloka in two stages. First, workers selected sentences, which contain an entity of particular type. Second, they marked up entity spans. As a result 1113 entities were obtained. Empirical evaluation of Razmecheno is carried out with off-the-shelf NER tools and by fine-tuning pre-trained contextualized encoders. We release the annotated dataset for open access.