2024
pdf
bib
abs
Embible: Reconstruction of Ancient Hebrew and Aramaic Texts Using Transformers
Niv Fono
|
Harel Moshayof
|
Eldar Karol
|
Itai Assraf
|
Mark Last
Findings of the Association for Computational Linguistics: EACL 2024
Hebrew and Aramaic inscriptions serve as an essential source of information on the ancient history of the Near East. Unfortunately, some parts of the inscribed texts become illegible over time. Special experts, called epigraphists, use time-consuming manual procedures to estimate the missing content. This problem can be considered an extended masked language modeling task, where the damaged content can comprise single characters, character n-grams (partial words), single complete words, and multi-word n-grams.This study is the first attempt to apply the masked language modeling approach to corrupted inscriptions in Hebrew and Aramaic languages, both using the Hebrew alphabet consisting mostly of consonant symbols. In our experiments, we evaluate several transformer-based models, which are fine-tuned on the Biblical texts and tested on three different percentages of randomly masked parts in the testing corpus. For any masking percentage, the highest text completion accuracy is obtained with a novel ensemble of word and character prediction models.
2022
pdf
bib
abs
An Interactive Analysis of User-reported Long COVID Symptoms using Twitter Data
Lin Miao
|
Mark Last
|
Marina Litvak
Proceedings of the 2nd Workshop on Deriving Insights from User-Generated Text
With millions of documented recoveries from COVID-19 worldwide, various long-term sequelae have been observed in a large group of survivors. This paper is aimed at systematically analyzing user-generated conversations on Twitter that are related to long-term COVID symptoms for a better understanding of the Long COVID health consequences. Using an interactive information extraction tool built especially for this purpose, we extracted key information from the relevant tweets and analyzed the user-reported Long COVID symptoms with respect to their demographic and geographical characteristics. The results of our analysis are expected to improve the public awareness on long-term COVID-19 sequelae and provide important insights to public health authorities.
2020
pdf
bib
abs
Detecting Troll Tweets in a Bilingual Corpus
Lin Miao
|
Mark Last
|
Marina Litvak
Proceedings of the Twelfth Language Resources and Evaluation Conference
During the past several years, a large amount of troll accounts has emerged with efforts to manipulate public opinion on social network sites. They are often involved in spreading misinformation, fake news, and propaganda with the intent of distracting and sowing discord. This paper aims to detect troll tweets in both English and Russian assuming that the tweets are generated by some “troll farm.” We reduce this task to the authorship verification problem of determining whether a single tweet is authored by a “troll farm” account or not. We evaluate a supervised classification approach with monolingual, cross-lingual, and bilingual training scenarios, using several machine learning algorithms, including deep learning. The best results are attained by the bilingual learning, showing the area under the ROC curve (AUC) of 0.875 and 0.828, for tweet classification in English and Russian test sets, respectively. It is noteworthy that these results are obtained using only raw text features, which do not require manual feature engineering efforts. In this paper, we introduce a resource of English and Russian troll tweets containing original tweets and translation from English to Russian, Russian to English. It is available for academic purposes.
pdf
bib
abs
Twitter Data Augmentation for Monitoring Public Opinion on COVID-19 Intervention Measures
Lin Miao
|
Mark Last
|
Marina Litvak
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020
The COVID-19 outbreak is an ongoing worldwide pandemic that was announced as a global health crisis in March 2020. Due to the enormous challenges and high stakes of this pandemic, governments have implemented a wide range of policies aimed at containing the spread of the virus and its negative effect on multiple aspects of our life. Public responses to various intervention measures imposed over time can be explored by analyzing the social media. Due to the shortage of available labeled data for this new and evolving domain, we apply data distillation methodology to labeled datasets from related tasks and a very small manually labeled dataset. Our experimental results show that data distillation outperforms other data augmentation methods on our task.
2019
pdf
bib
abs
Using Graphs for Word Embedding with Enhanced Semantic Relations
Matan Zuckerman
|
Mark Last
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)
Word embedding algorithms have become a common tool in the field of natural language processing. While some, like Word2Vec, are based on sequential text input, others are utilizing a graph representation of text. In this paper, we introduce a new algorithm, named WordGraph2Vec, or in short WG2V, which combines the two approaches to gain the benefits of both. The algorithm uses a directed word graph to provide additional information for sequential text input algorithms. Our experiments on benchmark datasets show that text classification algorithms are nearly as accurate with WG2V as with other word embedding models while preserving more stable accuracy rankings.
2016
pdf
bib
Exploring Long-Term Temporal Trends in the Use of Multiword Expressions
Tal Daniel
|
Mark Last
Proceedings of the 12th Workshop on Multiword Expressions
pdf
bib
MUSEEC: A Multilingual Text Summarization Tool
Marina Litvak
|
Natalia Vanetik
|
Mark Last
|
Elena Churkin
Proceedings of ACL-2016 System Demonstrations
2015
pdf
bib
Krimping texts for better summarization
Marina Litvak
|
Mark Last
|
Natalia Vanetik
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
2013
pdf
bib
Multilingual Single-Document Summarization with MUSE
Marina Litvak
|
Mark Last
Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization
2010
pdf
bib
A New Approach to Improving Multilingual Summarization Using a Genetic Algorithm
Marina Litvak
|
Mark Last
|
Menahem Friedman
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
pdf
bib
Towards multi-lingual summarization: A comparative analysis of sentence extraction methods on English and Hebrew corpora
Marina Litvak
|
Mark Last
|
Slava Kisilevich
|
Daniel Keim
|
Hagay Lipman
|
Assaf Ben Gur
Proceedings of the 4th Workshop on Cross Lingual Information Access
2008
pdf
bib
Graph-Based Keyword Extraction for Single-Document Summarization
Marina Litvak
|
Mark Last
Coling 2008: Proceedings of the workshop Multi-source Multilingual Information Extraction and Summarization