Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics

Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics Florian Kunneman Uxoa Iñurrieta John J. Camilleri Mariona Coll Ardanuy April 2017

Valencia, Spain

Association for Computational Linguistics http://www.aclweb.org/anthology/E17-4 book EACLSRW17:2017 Pragmatic descriptions of perceptual stimuli Emielvan Miltenburg Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 1–10 http://www.aclweb.org/anthology/E17-4001 This research proposal discusses pragmatic factors in image description, arguing that current automatic image description systems do not take these factors into account. I present a general model of the human image description process, and propose to study this process using corpus analysis, experiments, and computational modeling. This will lead to a better characterization of human image description behavior, providing a road map for future research in automatic image description, and the automatic description of perceptual stimuli in general. inproceedings vanmiltenburg:2017:EACLSRW17 Detecting spelling variants in non-standard texts FabianBarteld Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 11–22 http://www.aclweb.org/anthology/E17-4002 Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as an non-standard spelling for tomorrow. Consequently, in normalization – the standard approach of dealing with spelling variation – so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computer-mediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed. inproceedings barteld:2017:EACLSRW17 Replication issues in syntax-based aspect extraction for opinion mining EdisonMarrese-Taylor YutakaMatsuo Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 23–32 http://www.aclweb.org/anthology/E17-4003 Reproducing experiments is an important instrument to validate previous work and build upon existing approaches. It has been tackled numerous times in different areas of science. In this paper, we introduce an empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining. We show that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting, as well as due to the absence of available implementations that clarify these details. We consider these are important threats to validity of the research on the field, specifically when compared to other problems in NLP where public datasets and code availability are critical validity components. We conclude by encouraging code-based research, which we think has a key role in helping researchers to understand the meaning of the state-of-the-art better and to generate continuous advances. inproceedings marresetaylor-matsuo:2017:EACLSRW17 Discourse Relations and Conjoined VPs: Automated Sense Recognition ValentinaPyatkin BonnieWebber Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 33–42 http://www.aclweb.org/anthology/E17-4004 Sense classification of discourse relations is a sub-task of shallow discourse parsing. Discourse relations can occur both across sentences (ėxtit{inter-sentential}) and within sentences (ėxtit{intra-sentential}), and more than one discourse relation can hold between the same units. Using a newly available corpus of discourse-annotated intra-sentential conjoined verb phrases, we demonstrate a sequential classification pipeline for their multi-label sense classification. We assess the importance of each feature used in the classification, the feature scope, and what is lost in moving from gold standard manual parses to the output of an off-the-shelf parser. inproceedings pyatkin-webber:2017:EACLSRW17 Deception detection in Russian texts OlgaLitvinova PavelSeredin TatianaLitvinova JohnLyell Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 43–52 http://www.aclweb.org/anthology/E17-4005 Humans are known to detect deception in speech randomly and it is therefore important to develop tools to enable them to detect deception. The problem of deception detection has been studied for a significant amount of time, however the last 10-15 years have seen methods of computational linguistics being employed. Texts are processed using different NLP tools and then classified as deceptive/truthful using machine learning methods. While most research has been performed for English, Slavic languages have never been a focus of detection deception studies. The paper deals with deception detection in Russian narratives. It employs a specially designed corpus of truthful and deceptive texts on the same topic from each respondent, N = 113. The texts were processed using Linguistic Inquiry and Word Count software that is used in most studies of text-based deception detection. The list of parameters computed using the software was expanded due to the designed users' dictionaries. A variety of text classification methods was employed. The accuracy of the model was found to depend on the author's gender and text type (deceptive/truthful). inproceedings litvinova-EtAl:2017:EACLSRW17 A Computational Model of Human Preferences for Pronoun Resolution OlgaSeminck PascalAmsili Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 53–63 http://www.aclweb.org/anthology/E17-4006 We present a cognitive computational model of pronoun resolution that reproduces the human interpretation preferences of the Subject Assignment Strategy and the Parallel Function Strategy. Our model relies on a probabilistic pronoun resolution system trained on corpus data. Factors influencing pronoun resolution are represented as features weighted by their relative importance. The importance the model gives to the preferences is in line with psycholinguistic studies. We demonstrate the cognitive plausibility of the model by running it on experimental items and simulating antecedent choice and reading times of human participants. Our model can be used as a new means to study pronoun resolution, because it captures the interaction of preferences. inproceedings seminck-amsili:2017:EACLSRW17 Automatic Extraction of News Values from Headline Text AlicjaPiotrkowicz VaniaDimitrova KatjaMarkert Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 64–74 http://www.aclweb.org/anthology/E17-4007 Headlines play a crucial role in attracting audiences' attention to online artefacts (e.g. news articles, videos, blogs). The ability to carry out an automatic, large-scale analysis of headlines is critical to facilitate the selection and prioritisation of a large volume of digital content. In journalism studies news content has been extensively studied using manually annotated news values - factors used implicitly and explicitly when making decisions on the selection and prioritisation of news items. This paper presents the first attempt at a fully automatic extraction of news values from headline text. The news values extraction methods are applied on a large headlines corpus collected from The Guardian, and evaluated by comparing it with a manually annotated gold standard. A crowdsourcing survey indicates that news values affect people's decisions to click on a headline, supporting the need for an automatic news values detection. inproceedings piotrkowicz-dimitrova-markert:2017:EACLSRW17 Assessing Convincingness of Arguments in Online Debates with Limited Number of Features Lisa AndreevnaChalaguine ClaudiaSchulz Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 75–83 http://www.aclweb.org/anthology/E17-4008 We propose a new method in the field of argument analysis in social media to determining convincingness of arguments in online debates, following previous research by Habernal and Gurevych (2016). Rather than using argument specific feature values, we measure feature values relative to the average value in the debate, allowing us to determine argument convincingness with fewer features (between 5 and 35) than normally used for natural language processing tasks. We use a simple forward-feeding neural network for this task and achieve an accuracy of 0.77 which is comparable to the accuracy obtained using 64k features and a support vector machine by Habernal and Gurevych. inproceedings chalaguine-schulz:2017:EACLSRW17 Zipf's and Benford's laws in Twitter hashtags José AlbertoPérez-Melián J. AlbertoConejero CesarFerri Ramírez Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 84–93 http://www.aclweb.org/anthology/E17-4009 Social networks have transformed communication dramatically in recent years through the rise of new platforms and the development of a new language of communication. This landscape requires new forms to describe and predict the behaviour of users in networks. This paper presents an analysis of the frequency distribution of hashtag popularity in Twitter conversations. Our objective is to determine if these frequency distribution follow some well-known frequency distribution that many real-life sets of numerical data satisfy. In particular, we study the similarity of frequency distribution of hashtag popularity with respect to Zipf’s law, an empirical law referring to the phenomenon that many types of data in social sciences can be approximated with a Zipfian distribution. Additionally, we also analyse Benford’s law, is a special case of Zipf's law, a common pattern about the frequency distribution of leading digits. In order to compute correctly the frequency distribution of hashtag popularity, we need to correct many spelling errors that Twitter's users introduce. For this purpose we introduce a new filter to correct hashtag mistake based on string distances. The experiments obtained employing datasets of Twitter streams generated under controlled conditions show that Benford’s law and Zipf's law can be used to model hashtag frequency distribution. inproceedings perezmelian-conejero-ferriramirez:2017:EACLSRW17 A Multi-aspect Analysis of Automatic Essay Scoring for Brazilian Portuguese EvelinAmorim AdrianoVeloso Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 94–102 http://www.aclweb.org/anthology/E17-4010 Several methods for automatic essay scoring (AES) for English language have been proposed. However, multi-aspect AES systems for other languages are unusual. Therefore, we propose a multi-aspect AES system to apply on a dataset of Brazilian Portuguese essays, which human experts evaluated according to five aspects defined by Brazilian Government to the National Exam to High School Student (ENEM). These aspects are skills that student must master and every skill is assessed apart from each other. Besides the prediction of each aspect, the feature analysis also was performed for each aspect. The AES system proposed employs several features already employed by AES systems for English language. Our results show that predictions for some aspects performed well with the features we employed, while predictions for other aspects performed poorly. Also, it is possible to note the difference between the five aspects in the detailed feature analysis we performed. Besides these contributions, the eight millions of enrollments every year for ENEM raise some challenge issues for future directions in our research. inproceedings amorim-veloso:2017:EACLSRW17 Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings RafaelEhren Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 103–112 http://www.aclweb.org/anthology/E17-4011 Non-compositional multiword expressions (MWEs) still pose serious issues for a variety of natural language processing tasks and their ubiquity makes it impossible to get around methods which automatically identify these kind of MWEs. The method presented in this paper was inspired by Sporleder and Li and is able to discriminate between the literal and non-literal use of an MWE in an unsupervised way. It is based on the assumption that words in a text form cohesive units. If the cohesion of these units is weakened by an expression, it is classified as literal, and otherwise as idiomatic. While Sporleder an Li used ėxtit{Normalized Google Distance} to modell semantic similarity, the present work examines the use of a variety of different word embeddings. inproceedings ehren:2017:EACLSRW17 Evaluating the Reliability and Interaction of Recursively Used Feature Classes for Terminology Extraction AnnaHätty MichaelDorna SabineSchulte im Walde Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics April 2017

Valencia, Spain

Association for Computational Linguistics 113–121 http://www.aclweb.org/anthology/E17-4012 Feature design and selection is a crucial aspect when treating terminology extraction as a machine learning classification problem. We designed feature classes which characterize different properties of terms based on distributions, and propose a new feature class for components of term candidates. By using random forests, we infer optimal features which are later used to build decision tree classifiers. We evaluate our method using the ACL RD-TEC dataset. We demonstrate the importance of the novel feature class for downgrading termhood which exploits properties of term components. Furthermore, our classification suggests that the identification of reliable term candidates should be performed successively, rather than just once. inproceedings hatty-dorna-schulteimwalde:2017:EACLSRW17