Frank Drewes


2024

pdf bib
CIPHE: A Framework for Document Cluster Interpretation and Precision from Human Exploration
Anton Eklund | Mona Forsman | Frank Drewes
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities

Document clustering models serve unique application purposes, which turns model quality into a property that depends on the needs of the individual investigator. We propose a framework, Cluster Interpretation and Precision from Human Exploration (CIPHE), for collecting and quantifying human interpretations of cluster samples. CIPHE tasks survey participants to explore actual document texts from cluster samples and records their perceptions. It also includes a novel inclusion task that is used to calculate the cluster precision in an indirect manner. A case study on news clusters shows that CIPHE reveals which clusters have multiple interpretation angles, aiding the investigator in their exploration.

2023

pdf bib
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies
Johanna Björklund | Frank Drewes | Anna Jonsson
Computational Linguistics, Volume 49, Issue 4 - December 2023

Graph-based semantic representations are popular in natural language processing, where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but that context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.

pdf bib
An Empirical Configuration Study of a Common Document Clustering Pipeline
Anton Eklund | Mona Forsman | Frank Drewes
Northern European Journal of Language Technology, Volume 9

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

pdf bib
ADCluster: Adaptive Deep Clustering for Unsupervised Learning from Unlabeled Documents
Arezoo Hatefi | Xuan-Son Vu | Monowar Bhuyan | Frank Drewes
Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023)

2022

pdf bib
Improved N-Best Extraction with an Evaluation on Language Data
Johanna Björklund | Frank Drewes | Anna Jonsson
Computational Linguistics, Volume 48, Issue 1 - March 2022

We show that a previously proposed algorithm for the N-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared with the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software Betty, and compared to the state-of-the-art algorithm for extracting the N best runs, implemented in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets with respect to running time, while Tiburon seems to be the more memory-efficient choice.

pdf bib
Dynamic Topic Modeling by Clustering Embeddings from Pretrained Language Models: A Research Proposal
Anton Eklund | Mona Forsman | Frank Drewes
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop

A new trend in topic modeling research is to do Neural Topic Modeling by Clustering document Embeddings (NTM-CE) created with a pretrained language model. Studies have evaluated static NTM-CE models and found them performing comparably to, or even better than other topic models. An important extension of static topic modeling is making the models dynamic, allowing the study of topic evolution over time, as well as detecting emerging and disappearing topics. In this research proposal, we present two research questions to understand dynamic topic modeling with NTM-CE theoretically and practically. To answer these, we propose four phases with the aim of establishing evaluation methods for dynamic topic modeling, finding NTM-CE-specific properties, and creating a framework for dynamic NTM-CE. For evaluation, we propose to use both quantitative measurements of coherence and human evaluation supported by our recently developed tool.

2021

pdf bib
Bridging Perception, Memory, and Inference through Semantic Relations
Johanna Björklund | Adam Dahlgren Lindström | Frank Drewes
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

There is a growing consensus that surface form alone does not enable models to learn meaning and gain language understanding. This warrants an interest in hybrid systems that combine the strengths of neural and symbolic methods. We favour triadic systems consisting of neural networks, knowledge bases, and inference engines. The network provides perception, that is, the interface between the system and its environment. The knowledge base provides explicit memory and thus immediate access to established facts. Finally, inference capabilities are provided by the inference engine which reflects on the perception, supported by memory, to reason and discover new facts. In this work, we probe six popular language models for semantic relations and outline a future line of research to study how the constituent subsystems can be jointly realised and integrated.

pdf bib
Proceedings of the 17th Meeting on the Mathematics of Language
Henrik Björklund | Frank Drewes
Proceedings of the 17th Meeting on the Mathematics of Language

2020

pdf bib
Probing Multimodal Embeddings for Linguistic Properties: the Visual-Semantic Case
Adam Dahlgren Lindström | Johanna Björklund | Suna Bensch | Frank Drewes
Proceedings of the 28th International Conference on Computational Linguistics

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion ofprobing tasks to the visual-semantic case. To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe for those properties, and (iv) compare various state-of-the-art embeddings under the lens of the proposed probing tasks. Our experiments reveal an up to 16% increase in accuracy on visual-semantic embeddings compared to the corresponding unimodal embeddings, which suggest that the text and image dimensions represented in the former do complement each other.

2019

pdf bib
A Survey of Recent Advances in Efficient Parsing for Graph Grammars
Frank Drewes
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

pdf bib
Bottom-Up Unranked Tree-to-Graph Transducers for Translation into Semantic Graphs
Johanna Björklund | Shay B. Cohen | Frank Drewes | Giorgio Satta
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

We propose a formal model for translating unranked syntactic trees, such as dependency trees, into semantic graphs. These tree-to-graph transducers can serve as a formal basis of transition systems for semantic parsing which recently have been shown to perform very well, yet hitherto lack formalization. Our model features “extended” rules and an arc-factored normal form, comes with an efficient translation algorithm, and can be equipped with weights in a straightforward manner.

pdf bib
Proceedings of the 16th Meeting on the Mathematics of Language
Philippe de Groote | Frank Drewes | Gerald Penn
Proceedings of the 16th Meeting on the Mathematics of Language

pdf bib
Parsing Weighted Order-Preserving Hyperedge Replacement Grammars
Henrik Björklund | Frank Drewes | Petter Ericson
Proceedings of the 16th Meeting on the Mathematics of Language

2018

pdf bib
Weighted DAG Automata for Semantic Graphs
David Chiang | Frank Drewes | Daniel Gildea | Adam Lopez | Giorgio Satta
Computational Linguistics, Volume 44, Issue 1 - April 2018

Graphs have a variety of uses in natural language processing, particularly as representations of linguistic meaning. A deficit in this area of research is a formal framework for creating, combining, and using models involving graphs that parallels the frameworks of finite automata for strings and finite tree automata for trees. A possible starting point for such a framework is the formalism of directed acyclic graph (DAG) automata, defined by Kamimura and Slutzki and extended by Quernheim and Knight. In this article, we study the latter in depth, demonstrating several new results, including a practical recognition algorithm that can be used for inference and learning with models defined on DAG automata. We also propose an extension to graphs with unbounded node degree and show that our results carry over to the extended formalism.

2017

pdf bib
DAG Automata for Meaning Representation
Frank Drewes
Proceedings of the 15th Meeting on the Mathematics of Language

pdf bib
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)
Frank Drewes
Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing (FSMNLP 2017)

pdf bib
Single-Rooted DAGs in Regular DAG Languages: Parikh Image and Path Languages
Martin Berglund | Henrik Björklund | Frank Drewes
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

pdf bib
Contextual Hyperedge Replacement Grammars for Abstract Meaning Representations
Frank Drewes | Anna Jonsson
Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms

2016

pdf bib
EM-Training for Weighted Aligned Hypergraph Bimorphisms
Frank Drewes | Kilian Gebhardt | Heiko Vogler
Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata

2013

pdf bib
On the Parameterized Complexity of Linear Context-Free Rewriting Systems
Martin Berglund | Henrik Björklund | Frank Drewes
Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13)

2012

pdf bib
Proceedings of the Workshop on Applications of Tree Automata Techniques in Natural Language Processing
Frank Drewes | Marco Kuhlmann
Proceedings of the Workshop on Applications of Tree Automata Techniques in Natural Language Processing

2011

pdf bib
Incremental Construction of Millstream Configurations Using Graph Transformation
Suna Bensch | Frank Drewes | Helmut Jürgensen | Brink van der Merwe
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

2010

pdf bib
Proceedings of the 2010 Workshop on Applications of Tree Automata in Natural Language Processing
Frank Drewes | Marco Kuhlmann
Proceedings of the 2010 Workshop on Applications of Tree Automata in Natural Language Processing

pdf bib
Millstream Systems – a Formal Model for Linking Language Modules by Interfaces
Suna Bensch | Frank Drewes
Proceedings of the 2010 Workshop on Applications of Tree Automata in Natural Language Processing