Datasets of Slovene and Croatian Moderated News Comments

Nikola Ljubešić; Tomaž Erjavec; Darja Fišer

doi:10.18653/v1/W18-5116

Datasets of Slovene and Croatian Moderated News Comments

Nikola Ljubešić, Tomaž Erjavec, Darja Fišer

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.

Anthology ID:: W18-5116
Volume:: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Month:: October
Year:: 2018
Address:: Brussels, Belgium
Editors:: Darja Fišer, Ruihong Huang, Vinodkumar Prabhakaran, Rob Voigt, Zeerak Waseem, Jacqueline Wernimont
Venue:: ALW
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 124–131
Language:
URL:: https://aclanthology.org/W18-5116/
DOI:: 10.18653/v1/W18-5116
Bibkey:
Cite (ACL):: Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 124–131, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: Datasets of Slovene and Croatian Moderated News Comments (Ljubešić et al., ALW 2018)
Copy Citation:
PDF:: https://aclanthology.org/W18-5116.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{ljubesic-etal-2018-datasets,
    title = "Datasets of {S}lovene and {C}roatian Moderated News Comments",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and
      Erjavec, Toma{\v{z}}  and
      Fi{\v{s}}er, Darja",
    editor = "Fi{\v{s}}er, Darja  and
      Huang, Ruihong  and
      Prabhakaran, Vinodkumar  and
      Voigt, Rob  and
      Waseem, Zeerak  and
      Wernimont, Jacqueline",
    booktitle = "Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2)",
    month = oct,
    year = "2018",
    address = "Brussels, Belgium",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W18-5116/",
    doi = "10.18653/v1/W18-5116",
    pages = "124--131",
    abstract = "This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="ljubesic-etal-2018-datasets">
    <titleInfo>
        <title>Datasets of Slovene and Croatian Moderated News Comments</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Nikola</namePart>
        <namePart type="family">Ljubešić</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Tomaž</namePart>
        <namePart type="family">Erjavec</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Darja</namePart>
        <namePart type="family">Fišer</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2018-10</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Darja</namePart>
            <namePart type="family">Fišer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Ruihong</namePart>
            <namePart type="family">Huang</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Vinodkumar</namePart>
            <namePart type="family">Prabhakaran</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Rob</namePart>
            <namePart type="family">Voigt</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Zeerak</namePart>
            <namePart type="family">Waseem</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Jacqueline</namePart>
            <namePart type="family">Wernimont</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Brussels, Belgium</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.</abstract>
    <identifier type="citekey">ljubesic-etal-2018-datasets</identifier>
    <identifier type="doi">10.18653/v1/W18-5116</identifier>
    <location>
        <url>https://aclanthology.org/W18-5116/</url>
    </location>
    <part>
        <date>2018-10</date>
        <extent unit="page">
            <start>124</start>
            <end>131</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Datasets of Slovene and Croatian Moderated News Comments
%A Ljubešić, Nikola
%A Erjavec, Tomaž
%A Fišer, Darja
%Y Fišer, Darja
%Y Huang, Ruihong
%Y Prabhakaran, Vinodkumar
%Y Voigt, Rob
%Y Waseem, Zeerak
%Y Wernimont, Jacqueline
%S Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
%D 2018
%8 October
%I Association for Computational Linguistics
%C Brussels, Belgium
%F ljubesic-etal-2018-datasets
%X This paper presents two large newly constructed datasets of moderated news comments from two highly popular online news portals in the respective countries: the Slovene RTV MCC and the Croatian 24sata. The datasets are analyzed by performing manual annotation of the types of the content which have been deleted by moderators and by investigating deletion trends among users and threads. Next, initial experiments on automatically detecting the deleted content in the datasets are presented. Both datasets are published in encrypted form, to enable others to perform experiments on detecting content to be deleted without revealing potentially inappropriate content. Finally, the baseline classification models trained on the non-encrypted datasets are disseminated as well to enable real-world use.
%R 10.18653/v1/W18-5116
%U https://aclanthology.org/W18-5116/
%U https://doi.org/10.18653/v1/W18-5116
%P 124-131

Download as File

Markdown (Informal)

[Datasets of Slovene and Croatian Moderated News Comments](https://aclanthology.org/W18-5116/) (Ljubešić et al., ALW 2018)

Datasets of Slovene and Croatian Moderated News Comments (Ljubešić et al., ALW 2018)

ACL

Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2018. Datasets of Slovene and Croatian Moderated News Comments. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pages 124–131, Brussels, Belgium. Association for Computational Linguistics.