Dragan Milchevski


2024

pdf bib
A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites
Anna Hätty | Dragan Milchevski | Kersten Döring | Marko Putnikovic | Mohsen Mesgar | Filip Novović | Maximilian Braun | Karina Leoni Borimann | Igor Stranjanac
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, recent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules: web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making product information extraction more effective and cost-efficient.

pdf bib
AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports
Lukas Lange | Marc Müller | Ghazaleh Haratinezhad Torbati | Dragan Milchevski | Patrick Grau | Subhash Chandra Pujari | Annemarie Friedrich
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

2022

pdf bib
A Study on Entity Linking Across Domains: Which Data is Best for Fine-Tuning?
Hassan Soliman | Heike Adel | Mohamed H. Gad-Elrab | Dragan Milchevski | Jannik Strötgen
Proceedings of the 7th Workshop on Representation Learning for NLP

Entity linking disambiguates mentions by mapping them to entities in a knowledge graph (KG). One important question in today’s research is how to extend neural entity linking systems to new domains. In this paper, we aim at a system that enables linking mentions to entities from a general-domain KG and a domain-specific KG at the same time. In particular, we represent the entities of different KGs in a joint vector space and address the questions of which data is best suited for creating and fine-tuning that space, and whether fine-tuning harms performance on the general domain. We find that a combination of data from both the general and the special domain is most helpful. The first is especially necessary for avoiding performance loss on the general domain. While additional supervision on entities that appear in both KGs performs best in an intrinsic evaluation of the vector space, it has less impact on the downstream task of entity linking.

2016

pdf bib
DeepLife: An Entity-aware Search, Analytics and Exploration Platform for Health and Life Sciences
Patrick Ernst | Amy Siu | Dragan Milchevski | Johannes Hoffart | Gerhard Weikum
Proceedings of ACL-2016 System Demonstrations