2022
pdf
bib
Optimizing singular value based similarity measures for document similarity comparisons
Jarkko Lagus
|
Arto Klami
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)
pdf
bib
Second-order Document Similarity Metrics for Transformers
Jarkko Lagus
|
Niki Loppi
|
Arto Klami
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)
2021
pdf
bib
abs
Learning to Lemmatize in the Word Representation Space
Jarkko Lagus
|
Arto Klami
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
Lemmatization is often used with morphologically rich languages to address issues caused by morphological complexity, performed by grammar-based lemmatizers. We propose an alternative for this, in form of a tool that performs lemmatization in the space of word embeddings. Word embeddings as distributed representations natively encode some information about the relationship between base and inflected forms, and we show that it is possible to learn a transformation that approximately maps the embeddings of inflected forms to the embeddings of the corresponding lemmas. This facilitates an alternative processing pipeline that replaces traditional lemmatization with the lemmatizing transformation in downstream processing for any application. We demonstrate the method in the Finnish language, outperforming traditional lemmatizers in example task of document similarity comparison, but the approach is language independent and can be trained for new languages with mild requirements.
2019
pdf
bib
abs
Low-Rank Approximations of Second-Order Document Representations
Jarkko Lagus
|
Janne Sinkkonen
|
Arto Klami
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Document embeddings, created with methods ranging from simple heuristics to statistical and deep models, are widely applicable. Bag-of-vectors models for documents include the mean and quadratic approaches (Torki, 2018). We present evidence that quadratic statistics alone, without the mean information, can offer superior accuracy, fast document comparison, and compact document representations. In matching news articles to their comment threads, low-rank representations of only 3-4 times the size of the mean vector give most accurate matching, and in standard sentence comparison tasks, results are state of the art despite faster computation. Similarity measures are discussed, and the Frobenius product implicit in the proposed method is contrasted to Wasserstein or Bures metric from the transportation theory. We also shortly demonstrate matching of unordered word lists to documents, to measure topicality or sentiment of documents.
2018
pdf
bib
abs
Benchmarks and models for entity-oriented polarity detection
Lidia Pivovarova
|
Arto Klami
|
Roman Yangarber
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)
We address the problem of determining entity-oriented polarity in business news. This can be viewed as classifying the polarity of the sentiment expressed toward a given mention of a company in a news article. We present a complete, end-to-end approach to the problem. We introduce a new dataset of over 17,000 manually labeled documents, which is substantially larger than any currently available resources. We propose a benchmark solution based on convolutional neural networks for classifying entity-oriented polarity. Although our dataset is much larger than those currently available, it is small on the scale of datasets commonly used for training robust neural network models. To compensate for this, we use transfer learning—pre-train the model on a much larger dataset, annotated for a related but different classification task, in order to learn a good representation for business text, and then fine-tune it on the smaller polarity dataset.
2017
pdf
bib
abs
HCS at SemEval-2017 Task 5: Polarity detection in business news using convolutional neural networks
Lidia Pivovarova
|
Llorenç Escoter
|
Arto Klami
|
Roman Yangarber
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Task 5 of SemEval-2017 involves fine-grained sentiment analysis on financial microblogs and news. Our solution for determining the sentiment score extends an earlier convolutional neural network for sentiment analysis in several ways. We explicitly encode a focus on a particular company, we apply a data augmentation scheme, and use a larger data collection to complement the small training data provided by the task organizers. The best results were achieved by training a model on an external dataset and then tuning it using the provided training dataset.