Anna Feldman


2021

pdf bib
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Chris Leberknight | Preslav Nakov
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Findings of the NLP4IF-2021 Shared Tasks on Fighting the COVID-19 Infodemic and Censorship Detection
Shaden Shaar | Firoj Alam | Giovanni Da San Martino | Alex Nikolov | Wajdi Zaghouani | Preslav Nakov | Anna Feldman
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

We present the results and the main findings of the NLP4IF-2021 shared tasks. Task 1 focused on fighting the COVID-19 infodemic in social media, and it was offered in Arabic, Bulgarian, and English. Given a tweet, it asked to predict whether that tweet contains a verifiable claim, and if so, whether it is likely to be false, is of general interest, is likely to be harmful, and is worthy of manual fact-checking; also, whether it is harmful to society, and whether it requires the attention of policy makers. Task 2 focused on censorship detection, and was offered in Chinese. A total of ten teams submitted systems for task 1, and one team participated in task 2; nine teams also submitted a system description paper. Here, we present the tasks, analyze the results, and discuss the system submissions and the methods they used. Most submissions achieved sizable improvements over several baselines, and the best systems used pre-trained Transformers and ensembles. The data, the scorers and the leaderboards for the tasks are available at http://gitlab.com/NLP4IF/nlp4if-2021.

2020

pdf bib
Proceedings of the Second Workshop on Figurative Language Processing
Beata Beigman Klebanov | Ekaterina Shutova | Patricia Lichtenstein | Smaranda Muresan | Chee Wee | Anna Feldman | Debanjan Ghosh
Proceedings of the Second Workshop on Figurative Language Processing

pdf bib
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Giovanni Da San Martino | Chris Brew | Giovanni Luca Ciampaglia | Anna Feldman | Chris Leberknight | Preslav Nakov
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

2019

pdf bib
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Alberto Barrón-Cedeño | Chris Brew | Chris Leberknight | Preslav Nakov
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Neural Network Prediction of Censorable Language
Kei Yin Ng | Anna Feldman | Jing Peng | Chris Leberknight
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

Internet censorship imposes restrictions on what information can be publicized or viewed on the Internet. According to Freedom House’s annual Freedom on the Net report, more than half the world’s Internet users now live in a place where the Internet is censored or restricted. China has built the world’s most extensive and sophisticated online censorship system. In this paper, we describe a new corpus of censored and uncensored social media tweets from a Chinese microblogging website, Sina Weibo, collected by tracking posts that mention ‘sensitive’ topics or authored by ‘sensitive’ users. We use this corpus to build a neural network classifier to predict censorship. Our model performs with a 88.50% accuracy using only linguistic features. We discuss these features in detail and hypothesize that they could potentially be used for censorship circumvention.

2018

pdf bib
Designing a Russian Idiom-Annotated Corpus
Katsiaryna Aharodnik | Anna Feldman | Jing Peng
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom
Chris Brew | Anna Feldman | Chris Leberknight
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom

pdf bib
Linguistic Characteristics of Censorable Language on SinaWeibo
Kei Yin Ng | Anna Feldman | Jing Peng | Chris Leberknight
Proceedings of the First Workshop on Natural Language Processing for Internet Freedom

This paper investigates censorship from a linguistic perspective. We collect a corpus of censored and uncensored posts on a number of topics, build a classifier that predicts censorship decisions independent of discussion topics. Our investigation reveals that the strongest linguistic indicator of censored content of our corpus is its readability.

pdf bib
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Feldman | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

2017

pdf bib
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Feldman | Anna Kazantseva | Nils Reiter | Stan Szpakowicz
Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

2016

pdf bib
Proceedings of the Fifth Workshop on Computational Linguistics for Literature
Anna Feldman | Anna Kazantseva | Stan Szpakowicz
Proceedings of the Fifth Workshop on Computational Linguistics for Literature

pdf bib
Experiments in Idiom Recognition
Jing Peng | Anna Feldman
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Some expressions can be ambiguous between idiomatic and literal interpretations depending on the context they occur in, e.g., ‘sales hit the roof’ vs. ‘hit the roof of the car’. We present a novel method of classifying whether a given instance is literal or idiomatic, focusing on verb-noun constructions. We report state-of-the-art results on this task using an approach based on the hypothesis that the distributions of the contexts of the idiomatic phrases will be different from the contexts of the literal usages. We measure contexts by using projections of the words into vector space. For comparison, we implement Fazly et al. (2009)’s, Sporleder and Li (2009)’s, and Li and Sporleder (2010b)’s methods and apply them to our data. We provide experimental results validating the proposed techniques.

2015

pdf bib
Literature Lifts Up Computational Linguistics
David K. Elson | Anna Feldman | Anna Kazantseva | Stan Szpakowicz
Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics

pdf bib
Proceedings of the Fourth Workshop on Computational Linguistics for Literature
Anna Feldman | Anna Kazantseva | Stan Szpakowicz | Corina Koolen
Proceedings of the Fourth Workshop on Computational Linguistics for Literature

pdf bib
Classifying Idiomatic and Literal Expressions Using Vector Space Representations
Jing Peng | Anna Feldman | Hamza Jazmati
Proceedings of the International Conference Recent Advances in Natural Language Processing

2014

pdf bib
Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL)
Anna Feldman | Anna Kazantseva | Stan Szpakowicz
Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL)

pdf bib
Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions
Jing Peng | Anna Feldman | Ekaterina Vylomova
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Automatic Identification of Learners’ Language Background Based on Their Writing in Czech
Katsiaryna Aharodnik | Marco Chang | Anna Feldman | Jirka Hana
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2011

pdf bib
A low-budget tagger for Old Czech
Jirka Hana | Anna Feldman | Katsiaryna Aharodnik
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2010

pdf bib
Proceedings of the NAACL HLT 2010 Second Workshop on Computational Approaches to Linguistic Creativity
Paul Cook | Anna Feldman
Proceedings of the NAACL HLT 2010 Second Workshop on Computational Approaches to Linguistic Creativity

pdf bib
Challenges of Cheap Resource Creation
Jirka Hana | Anna Feldman
Proceedings of the Fourth Linguistic Annotation Workshop

pdf bib
Like Finding a Needle in a Haystack: Annotating the American National Corpus for Idiomatic Expressions
Laura Street | Nathan Michalov | Rachel Silverstein | Michael Reynolds | Lurdes Ruela | Felicia Flowers | Angela Talucci | Priscilla Pereira | Gabriella Morgon | Samantha Siegel | Marci Barousse | Antequa Anderson | Tashom Carroll | Anna Feldman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Our paper presents the details of a pilot study in which we tagged portions of the American National Corpus (ANC) for idioms composed of verb-noun constructions, prepositional phrases, and subordinate clauses. The three data sets we analyzed included 1,500-sentence samples from the spoken, the nonfiction, and the fiction portions of the ANC. Our paper provides the details of the tagset we developed, the motivation behind our choices, and the inter-annotator agreement measures we deemed appropriate for this task. In tagging the ANC for idiomatic expressions, our annotators achieved a high level of agreement (> .80) on the tags but a low level of agreement (< .00) on what constituted an idiom. These findings support the claim that identifying idiomatic and metaphorical expressions is a highly difficult and subjective task. In total, 135 idiom types and 154 idiom tokens were identified. Based on the total tokens found for each idiom class, we suggest that future research on idiom detection and idiom annotation include prepositional phrases as this class of idioms occurred frequently in the nonfiction and spoken samples of our corpus

pdf bib
A Positional Tagset for Russian
Jirka Hana | Anna Feldman
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Fusional languages have rich inflection. As a consequence, tagsets capturing their morphological features are necessarily large. A natural way to make a tagset manageable is to use a structured system. In this paper, we present a positional tagset for describing morphological properties of Russian. The tagset was inspired by the Czech positional system (Hajic, 2004). We have used preliminary versions of this tagset in our previous work (e.g., Hana et al. (2004, 2006); Feldman (2006); Feldman and Hana (2010)). Here, we both systematize and extend these preliminary versions (by adding information about animacy, aspect and reflexivity); give a more detailed description of the tagset and provide comparison with the Czech system. Each tag of the tagset consists of 16 positions, each encoding one morphological feature (part-of-speech, detailed part-of-speech, gender, animacy, number, case, possessor's gender and number, person, reflexivity, tense, aspect, degree of comparison, negation, voice, variant). The tagset contains approximately 2,000 tags.

2009

pdf bib
Proceedings of the Workshop on Computational Approaches to Linguistic Creativity
Anna Feldman | Birte Loenneker-Rodman
Proceedings of the Workshop on Computational Approaches to Linguistic Creativity

2008

pdf bib
Annotating an Arabic Learner Corpus for Error
Ghazi Abuhakema | Reem Faraj | Anna Feldman | Eileen Fitzpatrick
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper describes an ongoing project in which we are collecting a learner corpus of Arabic, developing a tagset for error annotation and performing Computer-aided Error Analysis (CEA) on the data. We adapted the French Interlanguage Database FRIDA tagset (Granger, 2003a) to the data. We chose FRIDA in order to follow a known standard and to see whether the changes needed to move from a French to an Arabic tagset would give us a measure of the distance between the two languages with respect to learner difficulty. The current collection of texts, which is constantly growing, contains intermediate and advanced-level student writings. We describe the need for such corpora, the learner data we have collected and the tagset we have developed. We also describe the error frequency distribution of both proficiency levels and the ongoing work.

pdf bib
Designing and Evaluating a Russian Tagset
Serge Sharoff | Mikhail Kopotev | Tomaž Erjavec | Anna Feldman | Dagmar Divjak
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper reports the principles behind designing a tagset to cover Russian morphosyntactic phenomena, modifications of the core tagset, and its evaluation. The tagset is based on the MULTEXT-East framework, while the decisions in designing it were aimed at achieving a balance between parameters important for linguists and the possibility to detect and disambiguate them automatically. The final tagset contains about 500 tags and achieves about 95% accuracy on the disambiguated portion of the Russian National Corpus. We have also produced a test set that can be shared with other researchers.

2007

pdf bib
Proceedings of the Workshop on Computational Approaches to Figurative Language
Anna Feldman | Xiaofei Lu
Proceedings of the Workshop on Computational Approaches to Figurative Language

2006

pdf bib
Book Reviews: Computational Linguistics: Models, Resources, Applications, edited by Igor A. Bolshakov and Alexander Gelbukh
Anna Feldman
Computational Linguistics, Volume 32, Number 3, September 2006

pdf bib
Tagging Portuguese with a Spanish Tagger
Jirka Hana | Anna Feldman | Luiz Amaral | Chris Brew
Proceedings of the Cross-Language Knowledge Induction Workshop

pdf bib
A Cross-language Approach to Rapid Creation of New Morpho-syntactically Annotated Resources
Anna Feldman | Jirka Hana | Chris Brew
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We take a novel approach to rapid, low-cost development of morpho-syntactically annotated resources without using parallel corpora or bilingual lexicons. The overall research question is how to exploit language resources and properties to facilitate and automate the creation of morphologically annotated corpora for new languages. This portability issue is especially relevant to minority languages, for which such resources are likely to remain unavailable in the foreseeable future. We compare the performance of our system on languages that belong to different language families (Romance vs. Slavic), as well as different language pairs within the same language family (Portuguese via Spanish vs. Catalan via Spanish). We show that across language families, the most difficult category is the category of nominals (the noun homonymy is challenging for morphological analysis and the order variation of adjectives within a sentence makes it challenging to create a realiable model), whereas different language families present different challenges with respect to their morpho-syntactic descriptions: for the Slavic languages, case is the most challenging category; for the Romance languages, gender is more challenging than case. In addition, we present an alternative evaluation metric for our system, where we measure how much human labor will be needed to convert the result of our tagging to a high precision annotated resource.

2004

pdf bib
A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources
Jiri Hana | Anna Feldman | Chris Brew
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing