AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training

Thakur Ashutosh Suman; Abhinav Jain

doi:10.18653/v1/2021.semeval-1.118

AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection. Our solution is built upon RoBERTa language model and Conditional Random Fields (CRF). We pre-trained RoBERTa on Civil Comments dataset, enabling it to create better contextual representation for this task. We also employed the semi-supervised learning technique of self-training, which allowed us to extend our training dataset. In addition to these, we also identified some pre-processing steps that significantly improved our F1 score. Our proposed system achieved a rank of 41 with an F1 score of 66.16%.

Anthology ID:: 2021.semeval-1.118
Volume:: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Month:: August
Year:: 2021
Address:: Online
Editors:: Alexis Palmer, Nathan Schneider, Natalie Schluter, Guy Emerson, Aurelie Herbelot, Xiaodan Zhu
Venue:: SemEval
SIG:: SIGLEX
Publisher:: Association for Computational Linguistics
Note:
Pages:: 875–880
Language:
URL:: https://aclanthology.org/2021.semeval-1.118/
DOI:: 10.18653/v1/2021.semeval-1.118
Bibkey:
Cite (ACL):: Thakur Ashutosh Suman and Abhinav Jain. 2021. AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 875–880, Online. Association for Computational Linguistics.
Cite (Informal):: AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training (Suman & Jain, SemEval 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.semeval-1.118.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{suman-jain-2021-astartwice,
    title = "{AS}tar{T}wice at {S}em{E}val-2021 Task 5: Toxic Span Detection Using {R}o{BERT}a-{CRF}, Domain Specific Pre-Training and Self-Training",
    author = "Suman, Thakur Ashutosh  and
      Jain, Abhinav",
    editor = "Palmer, Alexis  and
      Schneider, Nathan  and
      Schluter, Natalie  and
      Emerson, Guy  and
      Herbelot, Aurelie  and
      Zhu, Xiaodan",
    booktitle = "Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.semeval-1.118/",
    doi = "10.18653/v1/2021.semeval-1.118",
    pages = "875--880",
    abstract = "This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection. Our solution is built upon RoBERTa language model and Conditional Random Fields (CRF). We pre-trained RoBERTa on Civil Comments dataset, enabling it to create better contextual representation for this task. We also employed the semi-supervised learning technique of self-training, which allowed us to extend our training dataset. In addition to these, we also identified some pre-processing steps that significantly improved our F1 score. Our proposed system achieved a rank of 41 with an F1 score of 66.16{\%}."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="suman-jain-2021-astartwice">
    <titleInfo>
        <title>AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Thakur</namePart>
        <namePart type="given">Ashutosh</namePart>
        <namePart type="family">Suman</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Abhinav</namePart>
        <namePart type="family">Jain</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2021-08</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Alexis</namePart>
            <namePart type="family">Palmer</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Nathan</namePart>
            <namePart type="family">Schneider</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Natalie</namePart>
            <namePart type="family">Schluter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Guy</namePart>
            <namePart type="family">Emerson</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Aurelie</namePart>
            <namePart type="family">Herbelot</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Xiaodan</namePart>
            <namePart type="family">Zhu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection. Our solution is built upon RoBERTa language model and Conditional Random Fields (CRF). We pre-trained RoBERTa on Civil Comments dataset, enabling it to create better contextual representation for this task. We also employed the semi-supervised learning technique of self-training, which allowed us to extend our training dataset. In addition to these, we also identified some pre-processing steps that significantly improved our F1 score. Our proposed system achieved a rank of 41 with an F1 score of 66.16%.</abstract>
    <identifier type="citekey">suman-jain-2021-astartwice</identifier>
    <identifier type="doi">10.18653/v1/2021.semeval-1.118</identifier>
    <location>
        <url>https://aclanthology.org/2021.semeval-1.118/</url>
    </location>
    <part>
        <date>2021-08</date>
        <extent unit="page">
            <start>875</start>
            <end>880</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training
%A Suman, Thakur Ashutosh
%A Jain, Abhinav
%Y Palmer, Alexis
%Y Schneider, Nathan
%Y Schluter, Natalie
%Y Emerson, Guy
%Y Herbelot, Aurelie
%Y Zhu, Xiaodan
%S Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
%D 2021
%8 August
%I Association for Computational Linguistics
%C Online
%F suman-jain-2021-astartwice
%X This paper describes our contribution to SemEval-2021 Task 5: Toxic Spans Detection. Our solution is built upon RoBERTa language model and Conditional Random Fields (CRF). We pre-trained RoBERTa on Civil Comments dataset, enabling it to create better contextual representation for this task. We also employed the semi-supervised learning technique of self-training, which allowed us to extend our training dataset. In addition to these, we also identified some pre-processing steps that significantly improved our F1 score. Our proposed system achieved a rank of 41 with an F1 score of 66.16%.
%R 10.18653/v1/2021.semeval-1.118
%U https://aclanthology.org/2021.semeval-1.118/
%U https://doi.org/10.18653/v1/2021.semeval-1.118
%P 875-880

Download as File

Markdown (Informal)

[AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training](https://aclanthology.org/2021.semeval-1.118/) (Suman & Jain, SemEval 2021)

AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training (Suman & Jain, SemEval 2021)

ACL

Thakur Ashutosh Suman and Abhinav Jain. 2021. AStarTwice at SemEval-2021 Task 5: Toxic Span Detection Using RoBERTa-CRF, Domain Specific Pre-Training and Self-Training. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 875–880, Online. Association for Computational Linguistics.