cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus

Sanjanasri Jp; Premjith B; Vijay Krishna Menon; Soman Kp

cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus

Sanjanasri JP, Premjith B, Vijay Krishna Menon, Soman KP

Correct Metadata for

Use this form to create a GitHub issue with structured data describing the correction. You will need a GitHub account. Once you create that issue, the correction will be reviewed by a staff member.

⚠️ Mobile Users: Submitting this form to create a new issue will only work with github.com, not the GitHub Mobile app.

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>. You may use <b>...</b> for bold, <i>...</i> for italic, and <url>...</url> for URLs.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data driven process. It employs machine learning and statistical algorithms to learn language structures from textual corpus. While application of NLP in English, certain European languages such as Spanish, German, etc. and Chinese, Arabic has been tremendous, it is not so, in many Indian languages. There are obvious advantages in creating aligned bilingual and multilingual corpora. Machine translation, cross-lingual information retrieval, content availability and linguistic comparison are a few of the most sought after applications of such parallel corpora. This paper explains and validates a parallel corpus we created for English-Tamil bilingual pair.

Anthology ID:: 2020.bucc-1.10
Volume:: Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Month:: May
Year:: 2020
Address:: Marseille, France
Editors:: Reinhard Rapp, Pierre Zweigenbaum, Serge Sharoff
Venue:: BUCC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 61–64
Language:: English
URL:: https://aclanthology.org/2020.bucc-1.10/
DOI:
Bibkey:
Cite (ACL):: Sanjanasri JP, Premjith B, Vijay Krishna Menon, and Soman KP. 2020. cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 61–64, Marseille, France. European Language Resources Association.
Cite (Informal):: cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus (JP et al., BUCC 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.bucc-1.10.pdf

PDF Cite Search Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{jp-etal-2020-centam,
    title = "c{E}n{T}am: Creation and Validation of a New {E}nglish-{T}amil Bilingual Corpus",
    author = "JP, Sanjanasri  and
      B, Premjith  and
      Menon, Vijay Krishna  and
      KP, Soman",
    editor = "Rapp, Reinhard  and
      Zweigenbaum, Pierre  and
      Sharoff, Serge",
    booktitle = "Proceedings of the 13th Workshop on Building and Using Comparable Corpora",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2020.bucc-1.10/",
    pages = "61--64",
    language = "eng",
    ISBN = "979-10-95546-42-9",
    abstract = "Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data driven process. It employs machine learning and statistical algorithms to learn language structures from textual corpus. While application of NLP in English, certain European languages such as Spanish, German, etc. and Chinese, Arabic has been tremendous, it is not so, in many Indian languages. There are obvious advantages in creating aligned bilingual and multilingual corpora. Machine translation, cross-lingual information retrieval, content availability and linguistic comparison are a few of the most sought after applications of such parallel corpora. This paper explains and validates a parallel corpus we created for English-Tamil bilingual pair."
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="jp-etal-2020-centam">
    <titleInfo>
        <title>cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Sanjanasri</namePart>
        <namePart type="family">JP</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Premjith</namePart>
        <namePart type="family">B</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Vijay</namePart>
        <namePart type="given">Krishna</namePart>
        <namePart type="family">Menon</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Soman</namePart>
        <namePart type="family">KP</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2020-05</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <language>
        <languageTerm type="text">eng</languageTerm>
    </language>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the 13th Workshop on Building and Using Comparable Corpora</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Reinhard</namePart>
            <namePart type="family">Rapp</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Pierre</namePart>
            <namePart type="family">Zweigenbaum</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Serge</namePart>
            <namePart type="family">Sharoff</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>European Language Resources Association</publisher>
            <place>
                <placeTerm type="text">Marseille, France</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
        <identifier type="isbn">979-10-95546-42-9</identifier>
    </relatedItem>
    <abstract>Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data driven process. It employs machine learning and statistical algorithms to learn language structures from textual corpus. While application of NLP in English, certain European languages such as Spanish, German, etc. and Chinese, Arabic has been tremendous, it is not so, in many Indian languages. There are obvious advantages in creating aligned bilingual and multilingual corpora. Machine translation, cross-lingual information retrieval, content availability and linguistic comparison are a few of the most sought after applications of such parallel corpora. This paper explains and validates a parallel corpus we created for English-Tamil bilingual pair.</abstract>
    <identifier type="citekey">jp-etal-2020-centam</identifier>
    <location>
        <url>https://aclanthology.org/2020.bucc-1.10/</url>
    </location>
    <part>
        <date>2020-05</date>
        <extent unit="page">
            <start>61</start>
            <end>64</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Conference Proceedings
%T cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus
%A JP, Sanjanasri
%A B, Premjith
%A Menon, Vijay Krishna
%A KP, Soman
%Y Rapp, Reinhard
%Y Zweigenbaum, Pierre
%Y Sharoff, Serge
%S Proceedings of the 13th Workshop on Building and Using Comparable Corpora
%D 2020
%8 May
%I European Language Resources Association
%C Marseille, France
%@ 979-10-95546-42-9
%G eng
%F jp-etal-2020-centam
%X Natural Language Processing (NLP), is the field of artificial intelligence that gives the computer the ability to interpret, perceive and extract appropriate information from human languages. Contemporary NLP is predominantly a data driven process. It employs machine learning and statistical algorithms to learn language structures from textual corpus. While application of NLP in English, certain European languages such as Spanish, German, etc. and Chinese, Arabic has been tremendous, it is not so, in many Indian languages. There are obvious advantages in creating aligned bilingual and multilingual corpora. Machine translation, cross-lingual information retrieval, content availability and linguistic comparison are a few of the most sought after applications of such parallel corpora. This paper explains and validates a parallel corpus we created for English-Tamil bilingual pair.
%U https://aclanthology.org/2020.bucc-1.10/
%P 61-64

Download as File

Markdown (Informal)

[cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus](https://aclanthology.org/2020.bucc-1.10/) (JP et al., BUCC 2020)

cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus (JP et al., BUCC 2020)

ACL

Sanjanasri JP, Premjith B, Vijay Krishna Menon, and Soman KP. 2020. cEnTam: Creation and Validation of a New English-Tamil Bilingual Corpus. In Proceedings of the 13th Workshop on Building and Using Comparable Corpora, pages 61–64, Marseille, France. European Language Resources Association.