A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites

Anna Hätty, Dragan Milchevski, Kersten Döring, Marko Putnikovic, Mohsen Mesgar, Filip Novović, Maximilian Braun, Karina Leoni Borimann, Igor Stranjanac


Abstract
Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, recent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules: web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making product information extraction more effective and cost-efficient.
Anthology ID:
2024.emnlp-industry.106
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Month:
November
Year:
2024
Address:
Miami, Florida, US
Editors:
Franck Dernoncourt, Daniel Preoţiuc-Pietro, Anastasia Shimorina
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1444–1456
Language:
URL:
https://aclanthology.org/2024.emnlp-industry.106
DOI:
Bibkey:
Cite (ACL):
Anna Hätty, Dragan Milchevski, Kersten Döring, Marko Putnikovic, Mohsen Mesgar, Filip Novović, Maximilian Braun, Karina Leoni Borimann, and Igor Stranjanac. 2024. A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1444–1456, Miami, Florida, US. Association for Computational Linguistics.
Cite (Informal):
A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites (Hätty et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-industry.106.pdf