Zhengkai Tu
2023
Predictive Chemistry Augmented with Text Retrieval
Yujie Qian
|
Zhening Li
|
Zhengkai Tu
|
Connor Coley
|
Regina Barzilay
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
This paper focuses on using natural language descriptions to enhance predictive models in the chemistry field. Conventionally, chemoinformatics models are trained with extensive structured data manually extracted from the literature. In this paper, we introduce TextReact, a novel method that directly augments predictive chemistry with texts retrieved from the literature. TextReact retrieves text descriptions relevant for a given chemical reaction, and then aligns them with the molecular representation of the reaction. This alignment is enhanced via an auxiliary masked LM objective incorporated in the predictor training. We empirically validate the framework on two chemistry tasks: reaction condition recommendation and one-step retrosynthesis. By leveraging text retrieval, TextReact significantly outperforms state-of-the-art chemoinformatics models trained solely on molecular data.
2021
Don’t Change Me! User-Controllable Selective Paraphrase Generation
Mohan Zhang
|
Luchen Tan
|
Zihang Fu
|
Kun Xiong
|
Jimmy Lin
|
Ming Li
|
Zhengkai Tu
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
In the paraphrase generation task, source sentences often contain phrases that should not be altered. Which phrases, however, can be context dependent and can vary by application. Our solution to this challenge is to provide the user with explicit tags that can be placed around any arbitrary segment of text to mean “don’t change me!” when generating a paraphrase; the model learns to explicitly copy these phrases to the output. The contribution of this work is a novel data generation technique using distant supervision that allows us to start with a pretrained sequence-to-sequence model and fine-tune a paraphrase generator that exhibits this behavior, allowing user-controllable paraphrase generation. Additionally, we modify the loss during fine-tuning to explicitly encourage diversity in model output. Our technique is language agnostic, and we report experiments in English and Chinese.