2023
pdf
bib
abs
Idioms, Probing and Dangerous Things: Towards Structural Probing for Idiomaticity in Vector Space
Filip Klubička
|
Vasudevan Nedumpozhimana
|
John Kelleher
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
The goal of this paper is to learn more about how idiomatic information is structurally encoded in embeddings, using a structural probing method. We repurpose an existing English verbal multi-word expression (MWE) dataset to suit the probing framework and perform a comparative probing study of static (GloVe) and contextual (BERT) embeddings. Our experiments indicate that both encode some idiomatic information to varying degrees, but yield conflicting evidence as to whether idiomaticity is encoded in the vector norm, leaving this an open question. We also identify some limitations of the used dataset and highlight important directions for future work in improving its suitability for a probing analysis.
pdf
bib
abs
Using MT for multilingual covid-19 case load prediction from social media texts
Maja Popovic
|
Vasudevan Nedumpozhimana
|
Meegan Gower
|
Sneha Rautmare
|
Nishtha Jain
|
John Kelleher
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
In the context of an epidemiological study involving multilingual social media, this paper reports on the ability of machine translation systems to preserve content relevant for a document classification task designed to determine whether the social media text is related to covid. The results indicate that machine translation does provide a feasible basis for scaling epidemiological social media surveillance to multiple languages. Moreover, a qualitative error analysis revealed that the majority of classification errors are not caused by MT errors.
pdf
bib
abs
Medical Concept Mention Identification in Social Media Posts Using a Small Number of Sample References
Vasudevan Nedumpozhimana
|
Sneha Rautmare
|
Meegan Gower
|
Nishtha Jain
|
Maja Popović
|
Patricia Buffini
|
John Kelleher
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing
Identification of mentions of medical concepts in social media text can provide useful information for caseload prediction of diseases like Covid-19 and Measles. We propose a simple model for the automatic identification of the medical concept mentions in the social media text. We validate the effectiveness of the proposed model on Twitter, Reddit, and News/Media datasets.
2021
pdf
bib
abs
Finding BERT’s Idiomatic Key
Vasudevan Nedumpozhimana
|
John Kelleher
Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021)
Sentence embeddings encode information relating to the usage of idioms in a sentence. This paper reports a set of experiments that combine a probing methodology with input masking to analyse where in a sentence this idiomatic information is taken from, and what form it takes. Our results indicate that BERT’s idiomatic key is primarily found within an idiomatic expression, but also draws on information from the surrounding context. Also, BERT can distinguish between the disruption in a sentence caused by words missing and the incongruity caused by idiomatic usage.