An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

Lili Wang; Chongyang Gao; Jason Wei; Weicheng Ma; Ruibo Liu; Soroush Vosoughi

doi:10.18653/v1/2020.wnut-1.27

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma, Ruibo Liu, Soroush Vosoughi

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.

Anthology ID:: 2020.wnut-1.27
Volume:: Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
Month:: November
Year:: 2020
Address:: Online
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 209–214
Language:
URL:: https://aclanthology.org/2020.wnut-1.27/
DOI:: 10.18653/v1/2020.wnut-1.27
Bibkey:
Cite (ACL):: Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma, Ruibo Liu, and Soroush Vosoughi. 2020. An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 209–214, Online. Association for Computational Linguistics.
Cite (Informal):: An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data (Wang et al., WNUT 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.wnut-1.27.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{wang-etal-2020-empirical,
    title = "An Empirical Survey of Unsupervised Text Representation Methods on {T}witter Data",
    author = "Wang, Lili  and
      Gao, Chongyang  and
      Wei, Jason  and
      Ma, Weicheng  and
      Liu, Ruibo  and
      Vosoughi, Soroush",
    editor = "Xu, Wei  and
      Ritter, Alan  and
      Baldwin, Tim  and
      Rahimi, Afshin",
    booktitle = "Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wnut-1.27/",
    doi = "10.18653/v1/2020.wnut-1.27",
    pages = "209--214",
    abstract = "The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="wang-etal-2020-empirical">
    <titleInfo>
        <title>An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Lili</namePart>
        <namePart type="family">Wang</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chongyang</namePart>
        <namePart type="family">Gao</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jason</namePart>
        <namePart type="family">Wei</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Weicheng</namePart>
        <namePart type="family">Ma</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ruibo</namePart>
        <namePart type="family">Liu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Soroush</namePart>
        <namePart type="family">Vosoughi</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-11</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Wei</namePart>
            <namePart type="family">Xu</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alan</namePart>
            <namePart type="family">Ritter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Tim</namePart>
            <namePart type="family">Baldwin</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Afshin</namePart>
            <namePart type="family">Rahimi</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Online</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.</abstract>
    <identifier type="citekey">wang-etal-2020-empirical</identifier>
    <identifier type="doi">10.18653/v1/2020.wnut-1.27</identifier>
    <location>
        <url>https://aclanthology.org/2020.wnut-1.27/</url>
    </location>
    <part>
        <date>2020-11</date>
        <extent unit="page">
            <start>209</start>
            <end>214</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data
%A Wang, Lili
%A Gao, Chongyang
%A Wei, Jason
%A Ma, Weicheng
%A Liu, Ruibo
%A Vosoughi, Soroush
%Y Xu, Wei
%Y Ritter, Alan
%Y Baldwin, Tim
%Y Rahimi, Afshin
%S Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)
%D 2020
%8 November
%I Association for Computational Linguistics
%C Online
%F wang-etal-2020-empirical
%X The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a wide range of well-known text representation techniques for the task of text clustering on noisy Twitter data. Our results indicate that the more advanced models do not necessarily work best on tweets and that more exploration in this area is needed.
%R 10.18653/v1/2020.wnut-1.27
%U https://aclanthology.org/2020.wnut-1.27/
%U https://doi.org/10.18653/v1/2020.wnut-1.27
%P 209-214

Download as File

Markdown (Informal)

[An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data](https://aclanthology.org/2020.wnut-1.27/) (Wang et al., WNUT 2020)

An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data (Wang et al., WNUT 2020)

ACL

Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma, Ruibo Liu, and Soroush Vosoughi. 2020. An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 209–214, Online. Association for Computational Linguistics.