Tollef Emil JÃ, rgensen
2025
Enhancing Criminal Investigation Analysis with Summarization and Memory-based Retrieval-Augmented Generation: A Comprehensive Evaluation of Real Case Data
Mads Skipanes
|
Tollef Emil JÃ, rgensen
|
Kyle Porter
|
Gianluca Demartini
|
Sule Yildirim Yayilgan
Proceedings of the 31st International Conference on Computational Linguistics
This study introduces KriRAG, a novel Retrieval-Augmented Generation (RAG) architecture designed to assist criminal investigators in analyzing information and overcoming the challenge of information overload. KriRAG structures and summarizes extensive document collections based on existing investigative queries, providing relevant document references and detailed answers for each query. Working with unstructured data from two homicide case files comprising approximately 3,700 documents and 13,000 pages, a comprehensive evaluation methodology is established, incorporating semantic retrieval, scoring, reasoning, and query response accuracy. The system’s outputs are evaluated against queries and answers provided by criminal investigators, demonstrating promising performance with 97.5% accuracy in relevance assessment and 77.5% accuracy for query responses. These findings provide a rigorous foundation for other query-oriented and open-ended retrieval applications. KriRAG is designed to run offline on limited hardware, ensuring sensitive data handling and on-device availability.
Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings
Tollef Emil JÃ, rgensen
|
Ole Jakob Mengshoel
Proceedings of the 31st International Conference on Computational Linguistics
This paper explores the joint task of machine translation and sentence compression, emphasizing its application in subtitle generation for broadcast and live media for low-resource languages and hardware. We develop CLSC (Cross-Lingual Sentence Compression), a system trained on openly available parallel corpora organized by compression ratios, where the target length is constrained to a fraction of the source sentence length. We present two training methods: 1) Multiple Models (MM), where individual models are trained separately for each compression ratio, and 2) a Controllable Model (CM), a single model per language using a compression token to encode length constraints. We evaluate both subtitle data and transcriptions from the EuroParl corpus. To accommodate low-resource settings, we constrain data sampling for training and show results for transcriptions in French, Hungarian, Lithuanian, and Polish and subtitles in Albanian, Basque, Malay, and Norwegian. Our models preserve high semantic meaning and metric evaluations for compressed contexts.