Filip Novović


2024

pdf bib
A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites
Anna Hätty | Dragan Milchevski | Kersten Döring | Marko Putnikovic | Mohsen Mesgar | Filip Novović | Maximilian Braun | Karina Leoni Borimann | Igor Stranjanac
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, recent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules: web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making product information extraction more effective and cost-efficient.