Karina Leoni Borimann
2024
A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites
Anna Hätty
|
Dragan Milchevski
|
Kersten Döring
|
Marko Putnikovic
|
Mohsen Mesgar
|
Filip Novović
|
Maximilian Braun
|
Karina Leoni Borimann
|
Igor Stranjanac
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, recent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules: web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making product information extraction more effective and cost-efficient.
Search
Fix data
Co-authors
- Maximilian Braun 1
- Kersten Döring 1
- Anna Hätty 1
- Mohsen Mesgar 1
- Dragan Milchevski 1
- show all...