Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature Beatrice Alex Stefania Degaetano-Ortlieb Anna Feldman Anna Kazantseva Nils Reiter Stan Szpakowicz August 2017

Vancouver, Canada

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-22 book LaTeCH-CLfL:2017 Metaphor Detection in a Poetry Corpus VaibhavKesarwani DianaInkpen StanSzpakowicz ChrisTanasescu Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 1–9 http://www.aclweb.org/anthology/W17-2201 Metaphor is indispensable in poetry. It showcases the poet's creativity, and contributes to the overall emotional pertinence of the poem while honing its specific rhetorical impact. Previous work on metaphor detection relies on either rule-based or statistical models, none of them applied to poetry. Our method focuses on metaphor detection in a poetry corpus. It combines rule-based and statistical models (word embeddings) to develop a new classification system. Our system has achieved a precision of 0.759 and a recall of 0.804 in identifying one type of metaphor in poetry. inproceedings kesarwani-EtAl:2017:LaTeCH-CLfL Machine Translation and Automated Analysis of the Sumerian Language ÉmiliePagé-Perron MariaSukhareva IlyaKhait ChristianChiarcos Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 10–16 http://www.aclweb.org/anthology/W17-2202 This paper presents a newly funded international project for machine translation and automated analysis of ancient cuneiform languages where NLP specialists and Assyriologists collaborate to create an information retrieval system for Sumerian. This research is conceived in response to the need to translate large numbers of administrative texts that are only available in transcription, in order to make them accessible to a wider audience. The methodology includes creation of a specialized NLP pipeline and also the use of linguistic linked open data to increase access to the results. inproceedings pageperron-EtAl:2017:LaTeCH-CLfL Investigating the Relationship between Literary Genres and Emotional Plot Development EvgenyKim SebastianPadó RomanKlinger Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 17–26 http://www.aclweb.org/anthology/W17-2203 Literary genres are commonly viewed as being defined in terms of content and stylistic features. In this paper, we focus on one particular class of lexical features, namely emotion information, and investigate the hypothesis that emotion-related information correlates with particular genres. Us- ing genre classification as a testbed, we compare a model that computes lexicon- based emotion scores globally for complete stories with a model that tracks emotion arcs through stories on a subset of Project Gutenberg with five genres. Our main findings are: (a), the global emotion model is competitive with a large-vocabulary bag-of-words genre classifier (80%F1); (b), the emotion arc model shows a lower performance (59 % F1) but shows complementary behavior to the global model, as indicated by a very good performance of an oracle model (94 % F1) and an improved performance of an ensemble model (84 % F1); (c), genres differ in the extent to which stories follow the same emotional arcs, with particularly uniform behavior for anger (mystery) and fear (ad- ventures, romance, humor, science fiction). inproceedings kim-pado-klinger:2017:LaTeCH-CLfL Enjambment Detection in a Large Diachronic Corpus of Spanish Sonnets PabloRuiz ClaraMartínez Cantón ThierryPoibeau ElenaGonzález-Blanco Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 27–32 http://www.aclweb.org/anthology/W17-2204 Enjambment takes place when a syntactic unit is broken up across two lines of poetry, giving rise to different stylistic effects. In Spanish literary studies, there are unclear points about the types of stylistic effects that can arise, and under which linguistic conditions. To systematically gather evidence about this, we developed a system to automatically identify enjambment (and its type) in Spanish. For evaluation, we manually annotated a reference corpus covering different periods. As a scholarly corpus to apply the tool, from public HTML sources we created a diachronic corpus covering four centuries of sonnets (3750 poems), and we analyzed the occurrence of enjambment across stanzaic boundaries in different periods. Besides, we found examples that highlight limitations in current definitions of enjambment. inproceedings ruiz-EtAl:2017:LaTeCH-CLfL Plotting Markson's "Mistress" ConorKelleher MarkKeane Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 33–39 http://www.aclweb.org/anthology/W17-2205 The post-modern novel "Wittgenstein’s Mistress" by David Markson (1988) presents the reader with a very challenging non-linear narrative, that itself appears to one of the novel’s themes. We present a distant reading of this work designed to complement a close reading of it by David Foster Wallace (1990). Using a combination of text analysis, entity recognition and networks, we plot repetitive structures in the novel’s narrative relating them to its critical analysis. inproceedings kelleher-keane:2017:LaTeCH-CLfL Annotation Challenges for Reconstructing the Structural Elaboration of Middle Low German NinaSeemann Marie-LuisMerten MichaelaGeierhos DorisTophinke EykeHüllermeier Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 40–45 http://www.aclweb.org/anthology/W17-2206 In this paper, we present the annotation challenges we have encountered when working on a historical language that was undergoing elaboration processes. We especially focus on syntactic ambiguity and gradience in Middle Low German, which causes uncertainty to some extent. Since current annotation tools consider construction contexts and the dynamics of the grammaticalization only partially, we plan to extend CorA - a web-based annotation tool for historical and other non-standard language data - to capture elaboration phenomena and annotator unsureness. Moreover, we seek to interactively learn morphological as well as syntactic annotations. inproceedings seemann-EtAl:2017:LaTeCH-CLfL Phonological Soundscapes in Medieval Poetry ChristopherHench Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 46–56 http://www.aclweb.org/anthology/W17-2207 The oral component of medieval poetry was integral to its performance and reception. Yet many believe that the medieval voice has been forever lost, and any attempts at rediscovering it are doomed to failure due to scribal practices, manuscript mouvance, and linguistic normalization in editing practices. This paper offers a method to abstract from this noise and better understand relative differences in phonological soundscapes by considering syllable qualities. The presented syllabification method and soundscape analysis offer themselves as cross-disciplinary tools for low-resource languages. As a case study, we examine medieval German lyric and argue that the heavily debated lyrical ‘I’ follows a unique trajectory through soundscapes, shedding light on the performance and practice of these poets. inproceedings hench:2017:LaTeCH-CLfL An End-to-end Environment for Research Question-Driven Entity Extraction and Network Analysis AndreBlessing NoraEchelmeyer MarkusJohn NilsReiter Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 57–67 http://www.aclweb.org/anthology/W17-2208 This paper presents an approach to extract co-occurrence networks from literary texts. It is a deliberate decision not to aim for a fully automatic pipeline, as the literary research questions need to guide both the definition of the nature of the things that co-occur as well as how to decide co-occurrence. We showcase the approach on a Middle High German romance, parz. Manual inspection and discussion shows the huge impact various choices have. inproceedings blessing-EtAl:2017:LaTeCH-CLfL Modeling intra-textual variation with entropy and surprisal: topical vs. stylistic patterns StefaniaDegaetano-Ortlieb ElkeTeich Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 68–77 http://www.aclweb.org/anthology/W17-2209 We present a data-driven approach to investigate intra-textual variation by combining entropy and surprisal. With this approach we detect linguistic variation based on phrasal lexico-grammatical patterns across sections of research articles. Entropy is used to detect patterns typical of specific sections. Surprisal is used to differentiate between more and less informationally-loaded patterns as well as type of information (topical vs. stylistic). While we here focus on research articles in biology/genetics, the methodology is especially interesting for digital humanities scholars, as it can be applied to any text type or domain and combined with additional variables (e.g. time, author or social group). inproceedings degaetanoortlieb-teich:2017:LaTeCH-CLfL Finding a Character’s Voice: Stylome Classification on Literary Characters Liviu P.Dinu Ana SabinaUban Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 78–82 http://www.aclweb.org/anthology/W17-2210 We investigate in this paper the problem of classifying the stylome of characters in a literary work. Previous research in the field of authorship attribution has shown that the writing style of an author can be characterized and distinguished from that of other authors automatically. In this paper we take a look at the less approached problem of how the styles of different characters can be distinguished, trying to verify if an author managed to create believable characters with individual styles. We present the results of some initial experiments developed on the novel "Liaisons Dangereuses", showing that a simple bag of words model can be used to classify the characters. inproceedings dinu-uban:2017:LaTeCH-CLfL An Ontology-Based Method for Extracting and Classifying Domain-Specific Compositional Nominal Compounds Maria Piadi Buono Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 83–88 http://www.aclweb.org/anthology/W17-2211 In this paper, we present our preliminary study on an ontology-based method to extract and classify compositional nominal compounds in specific domains of knowledge. This method is based on the assumption that, applying a conceptual model to represent knowledge domain, it is possible to improve the extraction and classification of lexicon occurrences for that domain in a semi-automatic way. We explore the possibility of extracting and classifying a specific construction type (nominal compounds) spanning a specific domain (Cultural Heritage) and a specific language (Italian). inproceedings dibuono:2017:LaTeCH-CLfL Speeding up corpus development for linguistic research: language documentation and acquisition in Romansh Tuatschin GéraldineWalther BenoîtSagot Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 89–94 http://www.aclweb.org/anthology/W17-2212 In this paper, we present ongoing work for developing language resources and basic NLP tools for an undocumented variety of Romansh, in the context of a language documentation and language acquisition project. Our tools are meant to improve the speed and reliability of corpus annotations for noisy data involving large amounts of code-switching, occurrences of child-speech and orthographic noise. Being able to increase the efficiency of language resource development for language documentation and acquisition research also constitutes a step towards solving the data sparsity issues with which researchers have been struggling. inproceedings walther-sagot:2017:LaTeCH-CLfL Distantly Supervised POS Tagging of Low-Resource Languages under Extreme Data Sparsity: The Case of Hittite MariaSukhareva FrancescoFuscagni JohannesDaxenberger SusanneGörke DorisPrechel IrynaGurevych Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 95–104 http://www.aclweb.org/anthology/W17-2213 This paper presents a statistical approach to automatic morphosyntactic annotation of Hittite transcripts. Hittite is an extinct Indo-European language using the cuneiform script. There are currently no morphosyntactic annotations available for Hittite, so we explored methods of distant supervision. The annotations were projected from parallel German translations of the Hittite texts. In order to reduce data sparsity, we applied stemming of German and Hittite texts. As there is no off-the-shelf Hittite stemmer, a stemmer for Hittite was developed for this purpose. The resulting annotation projections were used to train a POS tagger, achieving an accuracy of 69% on a test sample. To our knowledge, this is the first attempt of statistical POS tagging of a cuneiform language. inproceedings sukhareva-EtAl:2017:LaTeCH-CLfL A Dataset for Sanskrit Word Segmentation AmrithKrishna Pavan KumarSatuluri PawanGoyal Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 105–114 http://www.aclweb.org/anthology/W17-2214 The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments. inproceedings krishna-satuluri-goyal:2017:LaTeCH-CLfL Lexical Correction of Polish Twitter Political Data MaciejOgrodniczuk MateuszKopeć Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature August 2017

Vancouver, Canada

Association for Computational Linguistics 115–125 http://www.aclweb.org/anthology/W17-2215 Language processing architectures are often evaluated in near-to-perfect conditions with respect to processed content. The tools which perform sufficiently well on electronic press, books and other type of non-interactive content may poorly handle littered, colloquial and multilingual textual data which make the majority of communication today. This paper aims at investigating how Polish Twitter data (in a slightly controlled `political' flavour) differs from expectation of linguistic tools and how they could be corrected to be ready for processing by standard language processing chains available for Polish. The setting includes specialised components for spelling correction of tweets as well as hashtag and username decoding. inproceedings ogrodniczuk-kopec:2017:LaTeCH-CLfL