Building Web Corpora for Minority Languages

Heidi Jauhiainen; Tommi Jauhiainen; Krister Lindén

Building Web Corpora for Minority Languages

Heidi Jauhiainen, Tommi Jauhiainen, Krister Lindén

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the “Finno-Ugric Languages and the Internet” (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.

Anthology ID:: 2020.wac-1.4
Volume:: Proceedings of the 12th Web as Corpus Workshop
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, Egon Stemle
Venue:: WAC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 23–32
Language:: English
URL:: https://aclanthology.org/2020.wac-1.4/
DOI:
Bibkey:
Cite (ACL):: Heidi Jauhiainen, Tommi Jauhiainen, and Krister Lindén. 2020. Building Web Corpora for Minority Languages. In Proceedings of the 12th Web as Corpus Workshop, pages 23–32, Marseille, France. European Language Resources Association.
Cite (Informal):: Building Web Corpora for Minority Languages (Jauhiainen et al., WAC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.wac-1.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{jauhiainen-etal-2020-building,
    title = "Building Web Corpora for Minority Languages",
    author = "Jauhiainen, Heidi  and
      Jauhiainen, Tommi  and
      Lind{\'e}n, Krister",
    editor = {Barbaresi, Adrien  and
      Bildhauer, Felix  and
      Sch{\"a}fer, Roland  and
      Stemle, Egon},
    booktitle = "Proceedings of the 12th Web as Corpus Workshop",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.wac-1.4/",
    pages = "23--32",
    language = "eng",
    ISBN = "979-10-95546-68-9",
    abstract = "Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the ``Finno-Ugric Languages and the Internet'' (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="jauhiainen-etal-2020-building">
    <titleInfo>
        <title>Building Web Corpora for Minority Languages</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Heidi</namePart>
        <namePart type="family">Jauhiainen</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Tommi</namePart>
        <namePart type="family">Jauhiainen</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Krister</namePart>
        <namePart type="family">Lindén</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <language>
        <languageTerm type="text">eng</languageTerm>
    </language>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 12th Web as Corpus Workshop</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Adrien</namePart>
            <namePart type="family">Barbaresi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Felix</namePart>
            <namePart type="family">Bildhauer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Roland</namePart>
            <namePart type="family">Schäfer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Egon</namePart>
            <namePart type="family">Stemle</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>European Language Resources Association</publisher>
            <place>
                <placeTerm type="text">Marseille, France</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-10-95546-68-9</identifier>
    </relatedItem>
    <abstract>Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the “Finno-Ugric Languages and the Internet” (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.</abstract>
    <identifier type="citekey">jauhiainen-etal-2020-building</identifier>
    <location>
        <url>https://aclanthology.org/2020.wac-1.4/</url>
    </location>
    <part>
        <date>2020-05</date>
        <extent unit="page">
            <start>23</start>
            <end>32</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Building Web Corpora for Minority Languages
%A Jauhiainen, Heidi
%A Jauhiainen, Tommi
%A Lindén, Krister
%Y Barbaresi, Adrien
%Y Bildhauer, Felix
%Y Schäfer, Roland
%Y Stemle, Egon
%S Proceedings of the 12th Web as Corpus Workshop
%D 2020
%8 May
%I European Language Resources Association
%C Marseille, France
%@ 979-10-95546-68-9
%G eng
%F jauhiainen-etal-2020-building
%X Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the “Finno-Ugric Languages and the Internet” (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.
%U https://aclanthology.org/2020.wac-1.4/
%P 23-32

Download as File

Markdown (Informal)

[Building Web Corpora for Minority Languages](https://aclanthology.org/2020.wac-1.4/) (Jauhiainen et al., WAC 2020)

Building Web Corpora for Minority Languages (Jauhiainen et al., WAC 2020)

ACL

Heidi Jauhiainen, Tommi Jauhiainen, and Krister Lindén. 2020. Building Web Corpora for Minority Languages. In Proceedings of the 12th Web as Corpus Workshop, pages 23–32, Marseille, France. European Language Resources Association.