Shaobai Jiang
2024
Structured Object Language Modeling (SO-LM): Native Structured Objects Generation Conforming to Complex Schemas with Self-Supervised Denoising
Amir Tavanaei
|
Kee Kiat Koo
|
Hayreddin Ceker
|
Shaobai Jiang
|
Qi Li
|
Julien Han
|
Karim Bouyarmane
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
In this paper, we study the problem of generating structured objects that conform to a complex schema, with intricate dependencies between the different components (facets) of the object. The facets of the object (attributes, fields, columns, properties) can be a mix of short, structured facts, or long natural-language descriptions. The object has to be self-consistent between the different facets in the redundant information it carries (relative consistency), while being grounded with respect to world knowledge (absolute consistency). We frame the problem as a Language Modeling problem (Structured Object Language Modeling) and train an LLM to perform the task natively, without requiring instructions or prompt-engineering. We propose a self-supervised denoising method to train the model from an existing dataset of such objects. The input query can be the existing object itself, in which case the system acts as a regenerator, completing, correcting, normalizing the input, or any unstructured blurb to be structured. We show that the self-supervised denoising training provides a strong baseline, and that additional supervised fine-tuning with small amount of human demonstrations leads to further improvement. Experimental results show that the proposed method matches or outperforms prompt-engineered general-purpose state-of-the-art LLMs (Claude 3, Mixtral-8x7B), while being order-of-magnitude more cost-efficient.
Search
Co-authors
- Amir Tavanaei 1
- Kee Kiat Koo 1
- Hayreddin Ceker 1
- Qi Li 1
- Julien Han 1
- show all...