LATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes

Kartik Mehta, Ioana Oprea, Nikhil Rasiwasia


Abstract
In this paper, we present LATEX-Numeric - a high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from unstructured product text like product description. Most of the past work on attribute extraction is not scalable as they rely on manually curated training data, either with or without use of active learning. We rely on distant supervision for training data generation, removing dependency on manual labels. One issue with distant supervision is that it leads to incomplete training annotation due to missing attribute values while matching. We propose a multi-task learning architecture to deal with missing labels in the training data, leading to F1 improvement of 9.2% for numeric attributes over state-of-the-art single-task architecture. While multi-task architecture benefits both numeric and non-numeric attributes, we present automated techniques to further improve the numeric attributes extraction models. Numeric attributes require a list of units (or aliases) for better matching with distant supervision. We propose an automated algorithm for alias creation using unstructured text and attribute values, leading to a 20.2% F1 improvement. Extensive experiments on real world datasets for 20 numeric attributes across 5 product categories and 3 English marketplaces show that LATEX-numeric achieves a high F1-score, without any manual intervention, making it suitable for practical applications. Finally we show that the improvements are language-agnostic and LATEX-Numeric achieves 13.9% F1 improvement for 3 non-English languages.
Anthology ID:
2021.naacl-industry.34
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
Month:
June
Year:
2021
Address:
Online
Editors:
Young-bum Kim, Yunyao Li, Owen Rambow
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
272–279
Language:
URL:
https://aclanthology.org/2021.naacl-industry.34
DOI:
10.18653/v1/2021.naacl-industry.34
Bibkey:
Cite (ACL):
Kartik Mehta, Ioana Oprea, and Nikhil Rasiwasia. 2021. LATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pages 272–279, Online. Association for Computational Linguistics.
Cite (Informal):
LATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes (Mehta et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-industry.34.pdf
Video:
 https://aclanthology.org/2021.naacl-industry.34.mp4