A Self-Training Approach for Short Text Clustering

Amir Hadifar; Lucas Sterckx; Thomas Demeester; Chris Develder

doi:10.18653/v1/W19-4322

A Self-Training Approach for Short Text Clustering

Amir Hadifar, Lucas Sterckx, Thomas Demeester, Chris Develder

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.

Anthology ID:: W19-4322
Volume:: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Month:: August
Year:: 2019
Address:: Florence, Italy
Editors:: Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Johannes Welbl, Alexis Conneau, Xiang Ren, Marek Rei
Venue:: RepL4NLP
SIG:: SIGREP
Publisher:: Association for Computational Linguistics
Note:
Pages:: 194–199
Language:
URL:: https://aclanthology.org/W19-4322/
DOI:: 10.18653/v1/W19-4322
Bibkey:
Cite (ACL):: Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 194–199, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: A Self-Training Approach for Short Text Clustering (Hadifar et al., RepL4NLP 2019)
Copy Citation:
PDF:: https://aclanthology.org/W19-4322.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{hadifar-etal-2019-self,
    title = "A Self-Training Approach for Short Text Clustering",
    author = "Hadifar, Amir  and
      Sterckx, Lucas  and
      Demeester, Thomas  and
      Develder, Chris",
    editor = "Augenstein, Isabelle  and
      Gella, Spandana  and
      Ruder, Sebastian  and
      Kann, Katharina  and
      Can, Burcu  and
      Welbl, Johannes  and
      Conneau, Alexis  and
      Ren, Xiang  and
      Rei, Marek",
    booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W19-4322/",
    doi = "10.18653/v1/W19-4322",
    pages = "194--199",
    abstract = "Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="hadifar-etal-2019-self">
    <titleInfo>
        <title>A Self-Training Approach for Short Text Clustering</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Amir</namePart>
        <namePart type="family">Hadifar</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Lucas</namePart>
        <namePart type="family">Sterckx</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Thomas</namePart>
        <namePart type="family">Demeester</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Chris</namePart>
        <namePart type="family">Develder</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2019-08</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Isabelle</namePart>
            <namePart type="family">Augenstein</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Spandana</namePart>
            <namePart type="family">Gella</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Sebastian</namePart>
            <namePart type="family">Ruder</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Katharina</namePart>
            <namePart type="family">Kann</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Burcu</namePart>
            <namePart type="family">Can</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Johannes</namePart>
            <namePart type="family">Welbl</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Alexis</namePart>
            <namePart type="family">Conneau</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Xiang</namePart>
            <namePart type="family">Ren</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Marek</namePart>
            <namePart type="family">Rei</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Florence, Italy</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.</abstract>
    <identifier type="citekey">hadifar-etal-2019-self</identifier>
    <identifier type="doi">10.18653/v1/W19-4322</identifier>
    <location>
        <url>https://aclanthology.org/W19-4322/</url>
    </location>
    <part>
        <date>2019-08</date>
        <extent unit="page">
            <start>194</start>
            <end>199</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T A Self-Training Approach for Short Text Clustering
%A Hadifar, Amir
%A Sterckx, Lucas
%A Demeester, Thomas
%A Develder, Chris
%Y Augenstein, Isabelle
%Y Gella, Spandana
%Y Ruder, Sebastian
%Y Kann, Katharina
%Y Can, Burcu
%Y Welbl, Johannes
%Y Conneau, Alexis
%Y Ren, Xiang
%Y Rei, Marek
%S Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
%D 2019
%8 August
%I Association for Computational Linguistics
%C Florence, Italy
%F hadifar-etal-2019-self
%X Short text clustering is a challenging problem when adopting traditional bag-of-words or TF-IDF representations, since these lead to sparse vector representations of the short texts. Low-dimensional continuous representations or embeddings can counter that sparseness problem: their high representational power is exploited in deep clustering algorithms. While deep clustering has been studied extensively in computer vision, relatively little work has focused on NLP. The method we propose, learns discriminative features from both an autoencoder and a sentence embedding, then uses assignments from a clustering algorithm as supervision to update weights of the encoder network. Experiments on three short text datasets empirically validate the effectiveness of our method.
%R 10.18653/v1/W19-4322
%U https://aclanthology.org/W19-4322/
%U https://doi.org/10.18653/v1/W19-4322
%P 194-199

Download as File

Markdown (Informal)

[A Self-Training Approach for Short Text Clustering](https://aclanthology.org/W19-4322/) (Hadifar et al., RepL4NLP 2019)

A Self-Training Approach for Short Text Clustering (Hadifar et al., RepL4NLP 2019)

ACL

Amir Hadifar, Lucas Sterckx, Thomas Demeester, and Chris Develder. 2019. A Self-Training Approach for Short Text Clustering. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 194–199, Florence, Italy. Association for Computational Linguistics.