Janne Bondi Johannessen

Also published as: Janne Bondi Johannessen

2018

The Norwegian Dependency Treebank is a new syntactic treebank for Norwegian Bokmäl and Nynorsk with manual syntactic and morphological annotation, developed at the National Library of Norway in collaboration with the University of Oslo. It is the first publically available treebank for Norwegian. This paper presents the core principles behind the syntactic annotation and how these principles were employed in certain specific cases. We then present the selection of texts and distribution between genres, as well as the annotation process and an evaluation of the inter-annotator agreement. Finally, we present the first results of data-driven dependency parsing of Norwegian, contrasting four state-of-the-art dependency parsers trained on the treebank. The consistency and the parsability of this treebank is shown to be comparable to other large treebank initiatives.

2013

pdf bib

Exploring Features for Named Entity Recognition in Lithuanian Text Corpus
Jurgita Kapočiūtė-Dzikienė | Anders Nøklestad | Janne Bondi Johannessen | Algis Krupavičius
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

pdf bib

Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
Stephan Oepen | Kristin Hagen | Janne Bondi Johannessen
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

bib abs

The Nordic Dialect Corpus
Janne Bondi Johannessen | Joel Priestley | Kristin Hagen | Anders Nøklestad | André Lynum
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we describe the Nordic Dialect Corpus, which has recently been completed. The corpus has a variety of features that combined makes it an advanced tool for language researchers. These features include: Linguistic contents (dialects from five closely related languages), annotation (tagging and two types of transcription), search interface (advanced possibilities for combining a large array of search criteria and results presentation in an intuitive and simple interface), many search variables (linguistics-based, informant-based, time-based), multimedia display (linking of sound and video to transcriptions), display of results in maps, display of informant details (number of words and other information on informants), advanced results handling (concordances, collocations, counts and statistics shown in a variety of graphical modes, plus further processing). Finally, and importantly, the corpus is freely available for research on the web. We give examples of both various kinds of searches, of displays of results and of results handling.

2011

pdf bib

What kind of corpus is a web corpus?
Janne Bondi Johannessen | Emiliano Raul Guevara
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

2010

pdf bib

A Multilingual Speech Resource: The Nordic Dialect Corpus
Janne Bondi Johannessen | Joel Priestley | Anders Nøklestad
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

pdf bib

Workshop on Advanced Corpus Solutions
Janne Bondi Johannessen
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation

bib abs

Enhancing Language Resources with Maps
Janne Bondi Johannessen | Kristin Hagen | Anders Nøklestad | Joel Priestley
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We will look at how maps can be integrated in research resources, such as language databases and language corpora. By using maps, search results can be illustrated in a way that immediately gives the user information that words or numbers on their own would not give. We will illustrate with two different resources, into which we have now added a Google Maps application: The Nordic Dialect Corpus (Johannessen et al. 2009) and The Nordic Syntactic Judgments Database (Lindstad et al. 2009). We have integrated Google Maps into these applications. The database contains some hundred syntactic test sentences that have been evaluated by four speakers in more than hundred locations in Norway and Sweden. Searching for the evaluations of a particular sentence gives a list of several hundred judgments, which are difficult for a human researcher to assess. With the map option, isoglosses are immediately visible. We show in the paper that both with the maps depicting corpus hits and with the maps depicting database results, the map visualizations actually show clear geographical differences that would be very difficult to spot just by reading concordance lines or database tables.

2009

pdf bib

The Nordic Dialect Database: Mapping Microsyntactic Variation in the Scandinavian Languages
Arne Martinus Lindstad | Anders Nøklestad | Janne Bondi Johannessen | Øystein Alexander Vangsnes
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

pdf bib

The Nordic Dialect Corpus–an advanced research tool
Janne Bondi Johannessen | Joel James Priestley | Kristin Hagen | Tor Anders Åfarli | Øystein Alexander Vangsnes
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

bib abs

Evaluation of Linguistics-Based Translation
Janne Bondi Johannessen | Torbjørn Nordgård | Lars Nygaard
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We report on the evaluation of the Norwegian-English MT prototype system LOGON. The system is rule-based and makes use of well-established frameworks for analysis and generation (LFG and HPSG). Minimal Recursion Semantics is the glue which performs transfer from source to target language and serves as the information vehicle between LFG and HPSG. The project-internal testing uses material from the training data sources from the domain guidebooks for mountain hiking in the summer season in Southern Norway. This testing, involving eight external assessors, yielded 57 % translated sentences, with acceptable fidelity measures, but with less than acceptable fluency measures. Additional test 1: The LOGON system is sensitive to vocabulary, so we were interested to see to what extent the system would be able to carry over to new texts from the same narrow domain. With only 22 % acceptable translations, this test had disappointing results. Additional test 2: Given the grammatical backbone of the system, we found it important to test it on a syntactic test-suite with only known vocabulary. Here, 55 % of the sentences had good translations. The tests show that even within a very narrow semantic domain, vocabulary sensitivity is the most crucial obstacle for this approach.

bib abs

Glossa: a Multilingual, Multimodal, Configurable User Interface
Lars Nygaard | Joel Priestley | Anders Nøklestad | Janne Bondi Johannessen
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe a web-based corpus query system, Glossa, which combines the expressiveness of regular query languages with the user-friendliness of a graphical interface. Since corpus users are usually linguists with little interest in technical matters, we have developed a system where the user need not have any prior knowledge of the search system. Furthermore, no previous knowledge of abbreviations for metavariables such as part of speech and source text is needed. All searches are done using checkboxes, pull-down menus, or writing simple letters to make words or other strings. Querying for more than one word is simply done by adding an additional query box, and for parts of words by choosing a feature such as start of word. The Glossa system also allows a wide range of viewing and post-processing options. Collocations can be viewed and counted in a number of ways, and be viewed as different kinds of graphical charts. Further annotation and deletion of single results for further processing is also easy. The Glossa system is already in use for a number of corpora. Corpus administrators can easily adapt the system to a wide range of corpora, including multilingual corpora and corpora with audio and video content.