Katsiaryna Mirylenka


2024

pdf bib
Adapting LLMs for Structured Natural Language API Integration
Robin Chan | Katsiaryna Mirylenka | Thomas Gschwind | Christoph Miksovic | Paolo Scotton | Enrico Toniato | Abdel Labbi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

API integration is crucial for enterprise systems, as it enables seamless interaction between applications within workflows. However, the diversity and complexity of the API landscape present significant challenges in combining API calls based on user intent.Existing methods rely on named entity recognition (NER) and knowledge graphs, but struggle to generate more complex control flow structures, such as conditionals and loops.We propose a novel framework that leverages the success of large language models (LLMs) in code generation to integrate APIs based on natural language input. Our approach involves fine-tuning an LLM using automatically generated API flows derived from OpenAPI specifications.We further evaluate the effectiveness of enforcing the syntax and schema adherence through constrained decoding.To enable systematic comparison, we introduce targeted test suites to assess the generalization capabilities of these approaches and their ability to retain structured knowledge.Our findings show that LLMs fine-tuned on OpenAPI specifications can (a) learn structural API constraints implicitly during training, and (b) achieve significant improvements in both in-distribution and out-of-distribution performance over NER and retrieval-augmented generation (RAG)-based approaches.

2023

pdf bib
Reinforced Active Learning for Low-Resource, Domain-Specific, Multi-Label Text Classification
Lukas Wertz | Jasmina Bogojeska | Katsiaryna Mirylenka | Jonas Kuhn
Findings of the Association for Computational Linguistics: ACL 2023

Text classification datasets from specialised or technical domains are in high demand, especially in industrial applications. However, due to the high cost of annotation such datasets are usually expensive to create. While Active Learning (AL) can reduce the labeling cost, required AL strategies are often only tested on general knowledge domains and tend to use information sources that are not consistent across tasks. We propose Reinforced Active Learning (RAL) to train a Reinforcement Learning policy that utilizes many different aspects of the data and the task in order to select the most informative unlabeled subset dynamically over the course of the AL procedure. We demonstrate the superior performance of the proposed RAL framework compared to strong AL baselines across four intricate multi-class, multi-label text classification datasets taken from specialised domains. In addition, we experiment with a unique data augmentation approach to further reduce the number of samples RAL needs to annotate.

2022

pdf bib
Investigating Active Learning Sampling Strategies for Extreme Multi Label Text Classification
Lukas Wertz | Katsiaryna Mirylenka | Jonas Kuhn | Jasmina Bogojeska
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Large scale, multi-label text datasets with high numbers of different classes are expensive to annotate, even more so if they deal with domain specific language. In this work, we aim to build classifiers on these datasets using Active Learning in order to reduce the labeling effort. We outline the challenges when dealing with extreme multi-label settings and show the limitations of existing Active Learning strategies by focusing on their effectiveness as well as efficiency in terms of computational cost. In addition, we present five multi-label datasets which were compiled from hierarchical classification tasks to serve as benchmarks in the context of extreme multi-label classification for future experiments. Finally, we provide insight into multi-class, multi-label evaluation and present an improved classifier architecture on top of pre-trained transformer language models.

pdf bib
Evaluating Pre-Trained Sentence-BERT with Class Embeddings in Active Learning for Multi-Label Text Classification
Lukas Wertz | Jasmina Bogojeska | Katsiaryna Mirylenka | Jonas Kuhn
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Transformer Language Model is a powerful tool that has been shown to excel at various NLP tasks and has become the de-facto standard solution thanks to its versatility. In this study, we employ pre-trained document embeddings in an Active Learning task to group samples with the same labels in the embedding space on a legal document corpus. We find that the calculated class embeddings are not close to the respective samples and consequently do not partition the embedding space in a meaningful way. In addition, we explore using the class embeddings as an Active Learning strategy with dramatically reduced results compared to all baselines.