<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="1200">
    <title>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</title>
    <editor>Preslav Nakov</editor>
    <editor>Marcos Zampieri</editor>
    <editor>Nikola Ljube&#x161;i&#x107;</editor>
    <editor>J&#246;rg Tiedemann</editor>
    <editor>Shevin Malmasi</editor>
    <editor>Ahmed Ali</editor>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-12</url>
    <bibtype>book</bibtype>
    <bibkey>VarDial:2017</bibkey>
  </paper>

  <paper id="1201">
    <title>Findings of the VarDial Evaluation Campaign 2017</title>
    <author><first>Marcos</first><last>Zampieri</last></author>
    <author><first>Shervin</first><last>Malmasi</last></author>
    <author><first>Nikola</first><last>Ljube&#x161;i&#x107;</last></author>
    <author><first>Preslav</first><last>Nakov</last></author>
    <author><first>Ahmed</first><last>Ali</last></author>
    <author><first>J&#246;rg</first><last>Tiedemann</last></author>
    <author><first>Yves</first><last>Scherrer</last></author>
    <author><first>No&#235;mi</first><last>Aepli</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;15</pages>
    <url>http://www.aclweb.org/anthology/W17-1201</url>
    <abstract>We present the results of the VarDial Evaluation Campaign on Natural Language
	Processing (NLP) for Similar Languages, Varieties and Dialects, which we
	organized as part of the fourth edition of the VarDial workshop at EACL'2017.
	This year, we included four shared tasks: Discriminating between Similar
	Languages (DSL), Arabic Dialect Identification (ADI), German Dialect
	Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19
	teams submitted runs across the four tasks, and 15 of them wrote system
	description papers.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zampieri-EtAl:2017:VarDial</bibkey>
  </paper>

  <paper id="1202">
    <title>Dialectometric analysis of language variation in Twitter</title>
    <author><first>Gonzalo</first><last>Donoso</last></author>
    <author><first>David</first><last>Sanchez</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>16&#8211;25</pages>
    <url>http://www.aclweb.org/anthology/W17-1202</url>
    <abstract>In the last few years, microblogging platforms such as Twitter have given rise
	to a deluge of textual data that can be used for the analysis of informal
	communication between millions of individuals. In this work, we propose an
	information-theoretic approach to geographic language variation using a corpus
	based on Twitter. We test our models with tens of concepts and their associated
	keywords detected in Spanish tweets geolocated in Spain. We employ
	dialectometric measures (cosine similarity and Jensen-Shannon divergence) to
	quantify the linguistic distance on the lexical level between cells created in
	a uniform grid over the map. This can be done for a single concept or in the
	general case taking into account an average of the considered variants. The
	latter permits an analysis of the dialects that naturally emerge from the data.
	Interestingly, our results reveal the existence of two dialect macrovarieties.
	The first group includes a region-specific speech spoken in small towns and
	rural areas whereas the second cluster encompasses cities that tend to use a
	more uniform variety. Since the results obtained with the two different metrics
	qualitatively agree, our work suggests that social media corpora can be
	efficiently used for dialectometric analyses.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>donoso-sanchez:2017:VarDial</bibkey>
  </paper>

  <paper id="1203">
    <title>Computational analysis of Gondi dialects</title>
    <author><first>Taraka</first><last>Rama</last></author>
    <author><first>&#199;a&#287;rı</first><last>&#199;&#246;ltekin</last></author>
    <author><first>Pavel</first><last>Sofroniev</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>26&#8211;35</pages>
    <url>http://www.aclweb.org/anthology/W17-1203</url>
    <abstract>This paper presents a computational analysis of Gondi dialects spoken in
	central India. We present a digitized data set of the dialect area, and analyze
	the data using different techniques from dialectometry, deep learning, and
	computational biology. We show that the methods largely agree with each other
	and with the earlier non-computational analyses of the language group.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>rama-ccoltekin-sofroniev:2017:VarDial</bibkey>
  </paper>

  <paper id="1204">
    <title>Investigating Diatopic Variation in a Historical Corpus</title>
    <author><first>Stefanie</first><last>Dipper</last></author>
    <author><first>Sandra</first><last>Waldenberger</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>36&#8211;45</pages>
    <url>http://www.aclweb.org/anthology/W17-1204</url>
    <abstract>This paper investigates diatopic variation in a historical corpus of German.
	Based on equivalent word forms from different language areas, replacement rules
	and mappings are derived which describe the relations between these word forms.
	These rules and mappings are then interpreted as reflections of morphological,
	phonological or graphemic variation. Based on sample rules and mappings, we
	show that our approach can replicate results from historical linguistics. While
	previous studies were restricted to predefined word lists, or confined to
	single authors or texts, our approach uses a much wider range of data available
	in historical corpora.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>dipper-waldenberger:2017:VarDial</bibkey>
  </paper>

  <paper id="1205">
    <title>Author Profiling at PAN: from Age and Gender Identification to Language Variety Identification (invited talk)</title>
    <author><first>Paolo</first><last>Rosso</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>46</pages>
    <url>http://www.aclweb.org/anthology/W17-1205</url>
    <abstract>Author profiling is the study of how language is shared by people, a problem of
	growing importance in applications dealing with security, in order to
	understand who could be behind an anonymous threat message, and marketing,
	where companies may be interested in knowing the demographics of people that in
	online reviews liked or disliked their products. In this talk we will give an
	overview of the PAN shared tasks that
	since 2013 have been organised at CLEF and FIRE evaluation forums, mainly on
	age and gender identification in social media, although also personality
	recognition in Twitter as well as in code sources was also addressed.
	In 2017 the PAN author profiling shared task addresses jointly gender and
	language variety identification in Twitter where tweets have been annotated 
	with authors' gender and their specific variation of their native language:
	English (Australia, Canada, Great Britain, Ireland, New  Zealand, United
	States), Spanish (Argentina, Chile, Colombia, Mexico,  Peru, Spain, Venezuela),
	Portuguese (Brazil, Portugal), and Arabic  (Egypt, Gulf, Levantine, Maghrebi).</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>rosso:2017:VarDial</bibkey>
  </paper>

  <paper id="1206">
    <title>The similarity and Mutual Intelligibility between Amharic and Tigrigna Varieties</title>
    <author><first>Tekabe Legesse</first><last>Feleke</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>47&#8211;54</pages>
    <url>http://www.aclweb.org/anthology/W17-1206</url>
    <abstract>The present study has examined the similarity and the mutual intelligibility
	between Amharic and Tigrigna using three tools namely Levenshtein distance,
	intelligibility test  and questionnaires. The study has shown that both
	Tigrigna varieties have almost equal phonetic and lexical distances from
	Amharic. The study also indicated that Amharic speakers understand less than
	50% of the two varieties. Furthermore, the study showed that Amharic speakers
	are more positive about the Ethiopian Tigrigna variety than the Eritrean
	Variety. However, their attitude towards the two varieties does not have an
	impact on their intelligibility. The Amharic speakers’ familiarity to the
	Tigrigna varieties is largely dependent on the genealogical relation between
	Amharic and the two Tigrigna varieties.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>feleke:2017:VarDial</bibkey>
  </paper>

  <paper id="1207">
    <title>Why Catalan-Spanish Neural Machine Translation? Analysis, comparison and combination with standard Rule and Phrase-based technologies</title>
    <author><first>Marta R.</first><last>Costa-juss&#224;</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>55&#8211;62</pages>
    <url>http://www.aclweb.org/anthology/W17-1207</url>
    <abstract>Catalan and Spanish are two related languages given that both derive from
	Latin.
	They share similarities in several linguistic levels including morphology,
	syntax and semantics. This makes them particularly interesting for the MT task.
	Given the recent appearance and popularity of neural MT, this paper analyzes
	the
	performance of this new approach compared to the well-established rule-based
	and phrase-based MT systems.
	Experiments are reported on a large database of 180 million words. Results,
	in terms of standard automatic measures, show that neural MT clearly
	outperforms
	the rule-based and phrase-based MT system on in-domain test set, but it is
	worst in the out-of-domain test set. A naive system combination specially works
	for the latter.
	In-domain manual analysis shows that neural MT tends to improve both adequacy
	and fluency, for example, by being able to generate more natural translations
	instead of literal ones, choosing to the adequate target word when the source
	word has several translations and improving gender agreement. However,
	out-of-domain manual analysis shows how neural MT is more affected by unknown
	words or contexts.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>costajussa:2017:VarDial</bibkey>
  </paper>

  <paper id="1208">
    <title>Kurdish Interdialect Machine Translation</title>
    <author><first>Hossein</first><last>Hassani</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>63&#8211;72</pages>
    <url>http://www.aclweb.org/anthology/W17-1208</url>
    <abstract>This research suggests a method for machine translation among two Kurdish
	dialects. We chose the two widely spoken dialects, Kurmanji and Sorani, which
	are considered to be mutually unintelligible. Also, despite being spoken by
	about 30 million people in different countries, Kurdish is among less-resourced
	languages. The research used bi-dialectal dictionaries and showed that the lack
	of parallel corpora is not a major obstacle in machine translation between the
	two dialects. The experiments showed that the machine translated texts are
	comprehensible to those who do not speak the dialect. The research is the first
	attempt for inter-dialect machine translation in Kurdish and particularly could
	help in making online texts in one dialect comprehensible to those who only
	speak the target dialect. The results showed that the translated texts are in
	71%  and 79% cases rated as understandable for Kurmanji and Sorani
	respectively. They are rated as slightly-understandable in 29% cases for
	Kurmanji and 21% for Sorani.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hassani:2017:VarDial</bibkey>
  </paper>

  <paper id="1209">
    <title>Twitter Language Identification Of Similar Languages And Dialects Without Ground Truth</title>
    <author><first>Jennifer</first><last>Williams</last></author>
    <author><first>Charlie</first><last>Dagli</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>73&#8211;83</pages>
    <url>http://www.aclweb.org/anthology/W17-1209</url>
    <abstract>We present a new method to bootstrap filter Twitter language ID labels in our
	dataset for automatic language identification (LID). Our method combines
	geo-location, original Twitter LID labels, and Amazon Mechanical Turk to
	resolve missing and unreliable labels. We are the first to compare LID
	classification performance using the MIRA algorithm and langid.py. We show
	classifier performance on different versions of our dataset with high accuracy
	using only Twitter data, without ground truth, and very few training examples.
	We also show how Platt Scaling can be use to calibrate MIRA classifier output
	values into a probability distribution over candidate classes, making the
	output more intuitive. Our method allows for fine-grained distinctions between
	similar languages and dialects and allows us to rediscover the language
	composition of our Twitter dataset.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>williams-dagli:2017:VarDial</bibkey>
  </paper>

  <paper id="1210">
    <title>Multi-source morphosyntactic tagging for spoken Rusyn</title>
    <author><first>Yves</first><last>Scherrer</last></author>
    <author><first>Achim</first><last>Rabus</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>84&#8211;92</pages>
    <url>http://www.aclweb.org/anthology/W17-1210</url>
    <abstract>This paper deals with the development of morphosyntactic taggers for spoken
	varieties of the Slavic minority language Rusyn. As neither annotated corpora
	nor parallel corpora are electronically available for Rusyn, we propose to
	combine existing resources from the etymologically close Slavic languages
	Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as
	tagging toolkit, we show that a tagger trained on a balanced set of the four
	source languages outperforms single language taggers by about 9%, and that
	additional automatically induced morphosyntactic lexicons lead to further
	improvements. The best observed accuracies for Rusyn are 82.4% for
	part-of-speech tagging and 75.5% for full morphological tagging.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>scherrer-rabus:2017:VarDial</bibkey>
  </paper>

  <paper id="1211">
    <title>Identifying dialects with textual and acoustic cues</title>
    <author><first>Abualsoud</first><last>Hanani</last></author>
    <author><first>Aziz</first><last>Qaroush</last></author>
    <author><first>Stephen</first><last>Taylor</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>93&#8211;101</pages>
    <url>http://www.aclweb.org/anthology/W17-1211</url>
    <abstract>We describe several systems for identifying short samples of Arabic or
	Swiss-German dialects, which were prepared for the shared task of the 2017 DSL
	Workshop.  The Arabic data comprises both text and
	acoustic files, and our best run combined both.  The Swiss-German data
	is text-only.  Coincidently, our best runs  achieved a
	accuracy of nearly 63% on both the Swiss-German and Arabic dialects tasks.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>hanani-qaroush-taylor:2017:VarDial</bibkey>
  </paper>

  <paper id="1212">
    <title>Evaluating HeLI with Non-Linear Mappings</title>
    <author><first>Tommi</first><last>Jauhiainen</last></author>
    <author><first>Krister</first><last>Lind&#233;n</last></author>
    <author><first>Heidi</first><last>Jauhiainen</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>102&#8211;108</pages>
    <url>http://www.aclweb.org/anthology/W17-1212</url>
    <abstract>In this paper we describe the non-linear mappings we used with the Helsinki
	language identification method, HeLI, in the 4th edition of the Discriminating
	between Similar Languages (DSL) shared task, which was organized as part of the
	VarDial 2017 workshop. Our SUKI team participated on the closed track together
	with 10 other teams. Our system reached the 7th position in the track. We
	describe the HeLI method and the non-linear mappings in mathematical notation.
	The HeLI method uses a probabilistic model with character n-grams and
	word-based backoff. We also describe our trials using the non-linear mappings
	instead of relative frequencies and we present statistics about the back-off
	function of the HeLI method.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>jauhiainen-linden-jauhiainen:2017:VarDial</bibkey>
  </paper>

  <paper id="1213">
    <title>A Perplexity-Based Method for Similar Languages Discrimination</title>
    <author><first>Pablo</first><last>Gamallo</last></author>
    <author><first>Jose Ramom</first><last>Pichel</last></author>
    <author><first>I&#241;aki</first><last>Alegria</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>109&#8211;114</pages>
    <url>http://www.aclweb.org/anthology/W17-1213</url>
    <abstract>This article describes the system submitted by the Citius_Ixa_Imaxin team to
	the VarDial 2017 (DSL and GDI tasks). The strategy underlying our system is
	based on a language distance computed by means of model perplexity. The best
	model configuration we have tested is a voting system making use of several
	$n$-grams models of both words and characters, even if word unigrams turned out
	to be a very competitive model with reasonable results in the tasks we have
	participated. An error analysis has been performed in which we identified many
	test examples with no linguistic evidences to distinguish among the variants.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gamallo-pichel-alegria:2017:VarDial</bibkey>
  </paper>

  <paper id="1214">
    <title>Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets</title>
    <author><first>Yves</first><last>Bestgen</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>115&#8211;123</pages>
    <url>http://www.aclweb.org/anthology/W17-1214</url>
    <abstract>This paper describes the system developed by the Centre for English Corpus
	Linguistics (CECL) to discriminating similar languages, language varieties and
	dialects. Based on a SVM with character and POStag n-grams as features and the
	BM25 weighting scheme, it achieved 92.7% accuracy in the Discriminating
	between Similar Languages (DSL) task, ranking first among eleven systems but
	with a lead over the next three teams of only 0.2%. A simpler version of the
	system ranked second in the German Dialect Identification (GDI) task thanks to
	several ad hoc postprocessing steps. Complementary analyses carried out by a
	cross-validation procedure suggest that the BM25 weighting scheme could be
	competitive in this type of tasks, at least in comparison with the sublinear
	TF-IDF. POStag n-grams also improved the system performance.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bestgen:2017:VarDial</bibkey>
  </paper>

  <paper id="1215">
    <title>Discriminating between Similar Languages with Word-level Convolutional Neural Networks</title>
    <author><first>Marcelo</first><last>Criscuolo</last></author>
    <author><first>Sandra Maria</first><last>Aluisio</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>124&#8211;130</pages>
    <url>http://www.aclweb.org/anthology/W17-1215</url>
    <abstract>Discriminating between Similar Languages (DSL) is a challenging task addressed
	at the VarDial Workshop series. We report on our participation in the DSL
	shared task with a two-stage system. In the first stage, character n-grams are
	used to separate language groups, then specialized classifiers distinguish
	similar language varieties. We have conducted experiments with three system
	configurations and submitted one run for each. Our main approach is a
	word-level convolutional neural network (CNN) that learns task-specific vectors
	with minimal text preprocessing. We also experiment with multi-layer perceptron
	(MLP) networks and another hybrid configuration. Our best run achieved an
	accuracy of 90.76%, ranking 8th among 11 participants and getting very close to
	the system that ranked first (less than 2 points). Even though the CNN model
	could not achieve the best results, it still makes a viable approach to
	discriminating between similar languages.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>criscuolo-aluisio:2017:VarDial</bibkey>
  </paper>

  <paper id="1216">
    <title>Cross-lingual dependency parsing for closely related languages - Helsinki's submission to VarDial 2017</title>
    <author><first>J&#246;rg</first><last>Tiedemann</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>131&#8211;136</pages>
    <url>http://www.aclweb.org/anthology/W17-1216</url>
    <abstract>This paper describes the submission from the University of Helsinki to the
	shared task on cross-lingual dependency parsing at VarDial 2017. We present
	work on annotation projection and treebank translation that gave good results
	for all three target languages in the test set. In particular, Slovak seems to
	work well with information coming from the Czech treebank, which is in line
	with related work. The attachment scores for cross-lingual models even surpass
	the fully supervised models trained on the target language treebank. Croatian
	is the most difficult language in the test set and the improvements over the
	baseline are rather modest. Norwegian works best with information coming from
	Swedish whereas Danish contributes surprisingly little.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>tiedemann:2017:VarDial</bibkey>
  </paper>

  <paper id="1217">
    <title>Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words</title>
    <author><first>Helena</first><last>Gomez</last></author>
    <author><first>Ilia</first><last>Markov</last></author>
    <author><first>Jorge</first><last>Baptista</last></author>
    <author><first>Grigori</first><last>Sidorov</last></author>
    <author><first>David</first><last>Pinto</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>137&#8211;145</pages>
    <url>http://www.aclweb.org/anthology/W17-1217</url>
    <abstract>This paper presents the cic_ualg's system that took part in the Discriminating
	between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop.
	This year's task aims at identifying 14 languages across 6 language groups
	using a corpus of excerpts of journalistic texts. Two classification approaches
	were compared: a single-step (all languages) approach and a two-step (language
	group and then languages within the group) approach. Features exploited include
	lexical features (unigrams of words) and character n-grams. Besides traditional
	(untyped) character n-grams, we introduce typed character n-grams in the DSL
	task. Experiments were carried out with different feature representation
	methods (binary and raw term frequency), frequency threshold values, and
	machine-learning algorithms &#8211; Support Vector Machines (SVM) and Multinomial
	Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>gomez-EtAl:2017:VarDial</bibkey>
  </paper>

  <paper id="1218">
    <title>T&#252;bingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing</title>
    <author><first>&#199;a&#287;rı</first><last>&#199;&#246;ltekin</last></author>
    <author><first>Taraka</first><last>Rama</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>146&#8211;155</pages>
    <url>http://www.aclweb.org/anthology/W17-1218</url>
    <abstract>This paper describes our systems and results on VarDial 2017 shared
	tasks.                    Besides three language/dialect discrimination tasks, we
	also
	participated in the cross-lingual dependency parsing (CLP) task using
	a simple methodology which we also briefly describe in this paper.
	For all the discrimination tasks, we used linear SVMs with character
	and word features.  The system achieves competitive results among
	other systems in the shared task.  We also report additional
	experiments with neural network models. The performance of neural
	network models was close but always below the corresponding SVM
	classifiers in the discrimination tasks.
	For the cross-lingual parsing task, we experimented with an approach
	based on automatically translating the source treebank to the target
	language, and training a parser on the translated treebank.  We used
	off-the-shelf tools for both translation and parsing.  Despite
	achieving better-than-baseline results, our scores in CLP tasks were
	substantially lower than the scores of the other participants.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ccoltekin-rama:2017:VarDial</bibkey>
  </paper>

  <paper id="1219">
    <title>When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages</title>
    <author><first>Maria</first><last>Medvedeva</last></author>
    <author><first>Martin</first><last>Kroon</last></author>
    <author><first>Barbara</first><last>Plank</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>156&#8211;163</pages>
    <url>http://www.aclweb.org/anthology/W17-1219</url>
    <abstract>We present the results of our participation in the VarDial 4 shared task on
	discriminating closely related languages. Our submission includes simple
	traditional models using linear support vector machines (SVMs) and a neural
	network (NN). The main idea was to leverage language group information. We did
	so with a two-layer approach in the traditional model and a multi-task
	objective in the neural network case. Our results confirm earlier findings:
	simple traditional models outperform neural networks consistently for this
	task, at least given the amount of systems we could examine in the available
	time. Our two-layer linear SVM ranked 2nd in the shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>medvedeva-kroon-plank:2017:VarDial</bibkey>
  </paper>

  <paper id="1220">
    <title>German Dialect Identification in Interview Transcriptions</title>
    <author><first>Shervin</first><last>Malmasi</last></author>
    <author><first>Marcos</first><last>Zampieri</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>164&#8211;169</pages>
    <url>http://www.aclweb.org/anthology/W17-1220</url>
    <abstract>This paper presents three systems submitted to the German Dialect
	Identification (GDI) task at the VarDial Evaluation Campaign 2017. The task
	consists of training models to identify the dialect of Swiss- German speech
	transcripts. The dialects included in the GDI dataset are Basel, Bern, Lucerne,
	and Zurich. The three systems we submitted are based on: a plurality ensemble,
	a mean probability ensemble, and a meta-classifier trained on character and
	word n-grams. The best results were obtained by the meta-classifier achieving
	68.1% accuracy and 66.2% F1-score, ranking first among the 10 teams which
	participated in the GDI shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>malmasi-zampieri:2017:VarDial1</bibkey>
  </paper>

  <paper id="1221">
    <title>CLUZH at VarDial GDI 2017: Testing a Variety of Machine Learning Tools for the Classification of Swiss German Dialects</title>
    <author><first>Simon</first><last>Clematide</last></author>
    <author><first>Peter</first><last>Makarov</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>170&#8211;177</pages>
    <url>http://www.aclweb.org/anthology/W17-1221</url>
    <abstract>Our submissions for the GDI 2017 Shared Task are the results from three
	different types of classifiers: Naı̈ve Bayes, Conditional Random Fields
	(CRF), and Support Vector Machine (SVM). Our CRF-based run achieves a weighted
	F1 score of 65% (third rank) being beaten by the best system by 0.9%. Measured
	by classification accuracy, our ensemble run (Naı̈ve Bayes, CRF, SVM) reaches
	67% (second rank) being 1% lower than the best system. We also describe our
	experiments with Recurrent Neural Network (RNN) architectures. Since they
	performed worse than our non-neural approaches we did not include them in the
	submission.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>clematide-makarov:2017:VarDial</bibkey>
  </paper>

  <paper id="1222">
    <title>Arabic Dialect Identification Using iVectors and ASR Transcripts</title>
    <author><first>Shervin</first><last>Malmasi</last></author>
    <author><first>Marcos</first><last>Zampieri</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>178&#8211;183</pages>
    <url>http://www.aclweb.org/anthology/W17-1222</url>
    <abstract>This paper presents the systems submitted by the MAZA team to the Arabic
	Dialect Identification (ADI) shared task at the VarDial Evaluation Campaign
	2017. The goal of the task is to evaluate computational models to identify the
	dialect of Arabic utterances using both audio and text transcriptions. The ADI
	shared task dataset included Modern Standard Arabic (MSA) and four Arabic
	dialects: Egyptian, Gulf, Levantine, and North-African. The three systems
	submitted by MAZA are based on combinations of multiple machine learning
	classifiers arranged as (1) voting ensemble; (2) mean probability ensemble; (3)
	meta-classifier. The best results were obtained by the meta-classifier
	achieving 71.7% accuracy, ranking second among the six teams which participated
	in the ADI shared task.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>malmasi-zampieri:2017:VarDial2</bibkey>
  </paper>

  <paper id="1223">
    <title>Discriminating between Similar Languages using Weighted Subword Features</title>
    <author><first>Adrien</first><last>Barbaresi</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>184&#8211;189</pages>
    <url>http://www.aclweb.org/anthology/W17-1223</url>
    <abstract>The present contribution revolves around a contrastive subword n-gram model
	which has been tested in the Discriminating between Similar Languages shared
	task. I present and discuss the method used in this 14-way language
	identification task comprising varieties of 6 main language groups. It features
	the following characteristics: (1) the preprocessing and conversion of a
	collection of documents to sparse features; (2) weighted character n-gram
	profiles; (3) a multinomial Bayesian classifier. Meaningful bag-of-n-grams
	features can be used as a system in a straightforward way, my approach
	outperforms most of the systems used in the DSL shared task (3rd rank).</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>barbaresi:2017:VarDial</bibkey>
  </paper>

  <paper id="1224">
    <title>Exploring Lexical and Syntactic Features for Language Variety Identification</title>
    <author><first>Chris</first><last>van der Lee</last></author>
    <author><first>Antal</first><last>van den Bosch</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>190&#8211;199</pages>
    <url>http://www.aclweb.org/anthology/W17-1224</url>
    <abstract>We present a method to discriminate between texts written in either the
	Netherlandic or the Flemish variant of the Dutch language. The method draws on
	a feature bundle representing text statistics, syntactic features, and word
	$n$-grams. Text statistics include average word length and sentence length,
	while syntactic features include ratios of function words and part-of-speech
	$n$-grams.
	        The effectiveness of the classifier was measured by classifying Dutch
	subtitles developed for either Dutch or Flemish television. Several machine
	learning algorithms were compared as well as feature combination methods in
	order to find the optimal generalization performance. A machine-learning meta
	classifier based on AdaBoost attained the best F-score of 0.92.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>vanderlee-vandenbosch:2017:VarDial</bibkey>
  </paper>

  <paper id="1225">
    <title>Learning to Identify Arabic and German Dialects using Multiple Kernels</title>
    <author><first>Radu Tudor</first><last>Ionescu</last></author>
    <author><first>Andrei</first><last>Butnaru</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>200&#8211;209</pages>
    <url>http://www.aclweb.org/anthology/W17-1225</url>
    <abstract>We present a machine learning approach for the Arabic Dialect Identification
	(ADI) and the German Dialect Identification (GDI) Closed Shared Tasks of the
	DSL 2017 Challenge. The proposed approach combines several kernels using
	multiple kernel learning. While most of our kernels are based on character
	p-grams (also known as n-grams) extracted from speech transcripts, we also use
	a kernel based on i-vectors, a low-dimensional representation of audio
	recordings, provided only for the Arabic data. In the learning stage, we
	independently employ Kernel Discriminant Analysis (KDA) and Kernel Ridge
	Regression (KRR). Our approach is shallow and simple, but the empirical results
	obtained in the shared tasks prove that it achieves very good results. Indeed,
	we ranked on the first place in the ADI Shared Task with a weighted F1 score of
	76.32% (4.62% above the second place) and on the fifth place in the GDI Shared
	Task with a weighted F1 score of 63.67% (2.57% below the first place).</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ionescu-butnaru:2017:VarDial</bibkey>
  </paper>

  <paper id="1226">
    <title>Slavic Forest, Norwegian Wood</title>
    <author><first>Rudolf</first><last>Rosa</last></author>
    <author><first>Daniel</first><last>Zeman</last></author>
    <author><first>David</first><last>Mare&#x10D;ek</last></author>
    <author><first>Zdeněk</first><last>&#x17D;abokrtsk&#253;</last></author>
    <booktitle>Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>210&#8211;219</pages>
    <url>http://www.aclweb.org/anthology/W17-1226</url>
    <abstract>We once had a corp,
	or should we say, it once had us
	They showed us its tags,
	isn't it great, unified tags
	They asked us to parse
	and they told us to use everything
	So we looked around
	and we noticed there was near nothing
	We took other langs,
	bitext aligned: words one-to-one
	We played for two weeks,
	and then they said, here is the test
	The parser kept training till morning,
	just until deadline
	So we had to wait and hope what we get
	would be just fine
	And, when we awoke,
	the results were done, we saw we'd won
	So, we wrote this paper,
	isn't it good, Norwegian wood.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>rosa-EtAl:2017:VarDial</bibkey>
  </paper>

</volume>

