Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data

Arra’Di Nur Rizal; Sara Stymne

Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.

Anthology ID:: 2020.calcs-1.4
Volume:: Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Thamar Solorio, Monojit Choudhury, Kalika Bali, Sunayana Sitaram, Amitava Das, Mona Diab
Venue:: CALCS
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 26–35
Language:: English
URL:: https://aclanthology.org/2020.calcs-1.4/
DOI:
Bibkey:
Cite (ACL):: Arra’Di Nur Rizal and Sara Stymne. 2020. Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data. In Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pages 26–35, Marseille, France. European Language Resources Association.
Cite (Informal):: Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data (Rizal & Stymne, CALCS 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.calcs-1.4.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{rizal-stymne-2020-evaluating,
    title = "Evaluating Word Embeddings for {I}ndonesian{--}{E}nglish Code-Mixed Text Based on Synthetic Data",
    author = "Rizal, Arra{'}Di Nur  and
      Stymne, Sara",
    editor = "Solorio, Thamar  and
      Choudhury, Monojit  and
      Bali, Kalika  and
      Sitaram, Sunayana  and
      Das, Amitava  and
      Diab, Mona",
    booktitle = "Proceedings of the 4th Workshop on Computational Approaches to Code Switching",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.calcs-1.4/",
    pages = "26--35",
    language = "eng",
    ISBN = "979-10-95546-66-5",
    abstract = "Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian{--}English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="rizal-stymne-2020-evaluating">
    <titleInfo>
        <title>Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Arra’Di</namePart>
        <namePart type="given">Nur</namePart>
        <namePart type="family">Rizal</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Sara</namePart>
        <namePart type="family">Stymne</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <language>
        <languageTerm type="text">eng</languageTerm>
    </language>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 4th Workshop on Computational Approaches to Code Switching</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Thamar</namePart>
            <namePart type="family">Solorio</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Monojit</namePart>
            <namePart type="family">Choudhury</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Kalika</namePart>
            <namePart type="family">Bali</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Sunayana</namePart>
            <namePart type="family">Sitaram</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Amitava</namePart>
            <namePart type="family">Das</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Mona</namePart>
            <namePart type="family">Diab</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>European Language Resources Association</publisher>
            <place>
                <placeTerm type="text">Marseille, France</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-10-95546-66-5</identifier>
    </relatedItem>
    <abstract>Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.</abstract>
    <identifier type="citekey">rizal-stymne-2020-evaluating</identifier>
    <location>
        <url>https://aclanthology.org/2020.calcs-1.4/</url>
    </location>
    <part>
        <date>2020-05</date>
        <extent unit="page">
            <start>26</start>
            <end>35</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data
%A Rizal, Arra’Di Nur
%A Stymne, Sara
%Y Solorio, Thamar
%Y Choudhury, Monojit
%Y Bali, Kalika
%Y Sitaram, Sunayana
%Y Das, Amitava
%Y Diab, Mona
%S Proceedings of the 4th Workshop on Computational Approaches to Code Switching
%D 2020
%8 May
%I European Language Resources Association
%C Marseille, France
%@ 979-10-95546-66-5
%G eng
%F rizal-stymne-2020-evaluating
%X Code-mixed texts are abundant, especially in social media, and poses a problem for NLP tools, which are typically trained on monolingual corpora. In this paper, we explore and evaluate different types of word embeddings for Indonesian–English code-mixed text. We propose the use of code-mixed embeddings, i.e. embeddings trained on code-mixed text. Because large corpora of code-mixed text are required to train embeddings, we describe a method for synthesizing a code-mixed corpus, grounded in literature and a survey. Using sentiment analysis as a case study, we show that code-mixed embeddings trained on synthesized data are at least as good as cross-lingual embeddings and better than monolingual embeddings.
%U https://aclanthology.org/2020.calcs-1.4/
%P 26-35

Download as File

Markdown (Informal)

[Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data](https://aclanthology.org/2020.calcs-1.4/) (Rizal & Stymne, CALCS 2020)

Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data (Rizal & Stymne, CALCS 2020)

ACL

Arra’Di Nur Rizal and Sara Stymne. 2020. Evaluating Word Embeddings for Indonesian–English Code-Mixed Text Based on Synthetic Data. In Proceedings of the 4th Workshop on Computational Approaches to Code Switching, pages 26–35, Marseille, France. European Language Resources Association.