@inproceedings{krishna-etal-2017-dataset,
    title = "A Dataset for {S}anskrit Word Segmentation",
    author = "Krishna, Amrith  and
      Satuluri, Pavan Kumar  and
      Goyal, Pawan",
    editor = "Alex, Beatrice  and
      Degaetano-Ortlieb, Stefania  and
      Feldman, Anna  and
      Kazantseva, Anna  and
      Reiter, Nils  and
      Szpakowicz, Stan",
    booktitle = "Proceedings of the Joint {SIGHUM} Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature",
    month = aug,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W17-2214/",
    doi = "10.18653/v1/W17-2214",
    pages = "105--114",
    abstract = "The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="krishna-etal-2017-dataset">
    <titleInfo>
        <title>A Dataset for Sanskrit Word Segmentation</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Amrith</namePart>
        <namePart type="family">Krishna</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Pavan</namePart>
        <namePart type="given">Kumar</namePart>
        <namePart type="family">Satuluri</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Pawan</namePart>
        <namePart type="family">Goyal</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2017-08</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <relatedItem type="host">
        <titleInfo>
            <title>Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature</title>
        </titleInfo>
        <name type="personal">
            <namePart type="given">Beatrice</namePart>
            <namePart type="family">Alex</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Stefania</namePart>
            <namePart type="family">Degaetano-Ortlieb</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Feldman</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Anna</namePart>
            <namePart type="family">Kazantseva</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Nils</namePart>
            <namePart type="family">Reiter</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <name type="personal">
            <namePart type="given">Stan</namePart>
            <namePart type="family">Szpakowicz</namePart>
            <role>
                <roleTerm authority="marcrelator" type="text">editor</roleTerm>
            </role>
        </name>
        <originInfo>
            <publisher>Association for Computational Linguistics</publisher>
            <place>
                <placeTerm type="text">Vancouver, Canada</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">conference publication</genre>
    </relatedItem>
    <abstract>The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.</abstract>
    <identifier type="citekey">krishna-etal-2017-dataset</identifier>
    <identifier type="doi">10.18653/v1/W17-2214</identifier>
    <location>
        <url>https://aclanthology.org/W17-2214/</url>
    </location>
    <part>
        <date>2017-08</date>
        <extent unit="page">
            <start>105</start>
            <end>114</end>
        </extent>
    </part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T A Dataset for Sanskrit Word Segmentation
%A Krishna, Amrith
%A Satuluri, Pavan Kumar
%A Goyal, Pawan
%Y Alex, Beatrice
%Y Degaetano-Ortlieb, Stefania
%Y Feldman, Anna
%Y Kazantseva, Anna
%Y Reiter, Nils
%Y Szpakowicz, Stan
%S Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
%D 2017
%8 August
%I Association for Computational Linguistics
%C Vancouver, Canada
%F krishna-etal-2017-dataset
%X The last decade saw a surge in digitisation efforts for ancient manuscripts in Sanskrit. Due to various linguistic peculiarities inherent to the language, even the preliminary tasks such as word segmentation are non-trivial in Sanskrit. Elegant models for Word Segmentation in Sanskrit are indispensable for further syntactic and semantic processing of the manuscripts. Current works in word segmentation for Sanskrit, though commendable in their novelty, often have variations in their objective and evaluation criteria. In this work, we set the record straight. We formally define the objectives and the requirements for the word segmentation task. In order to encourage research in the field and to alleviate the time and effort required in pre-processing, we release a dataset of 115,000 sentences for word segmentation. For each sentence in the dataset we include the input character sequence, ground truth segmentation, and additionally lexical and morphological information about all the phonetically possible segments for the given sentence. In this work, we also discuss the linguistic considerations made while generating the candidate space of the possible segments.
%R 10.18653/v1/W17-2214
%U https://aclanthology.org/W17-2214/
%U https://doi.org/10.18653/v1/W17-2214
%P 105-114
Markdown (Informal)
[A Dataset for Sanskrit Word Segmentation](https://aclanthology.org/W17-2214/) (Krishna et al., LaTeCH 2017)
ACL
- Amrith Krishna, Pavan Kumar Satuluri, and Pawan Goyal. 2017. A Dataset for Sanskrit Word Segmentation. In Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pages 105–114, Vancouver, Canada. Association for Computational Linguistics.