Sakriani Sakti


2021

pdf bib
NAIST English-to-Japanese Simultaneous Translation System for IWSLT 2021 Simultaneous Text-to-text Task
Ryo Fukuda | Yui Oka | Yasumasa Kano | Yuki Yano | Yuka Ko | Hirotaka Tokuyama | Kosuke Doi | Sakriani Sakti | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)

This paper describes NAIST’s system for the English-to-Japanese Simultaneous Text-to-text Translation Task in IWSLT 2021 Evaluation Campaign. Our primary submission is based on wait-k neural machine translation with sequence-level knowledge distillation to encourage literal translation.

2020

pdf bib
Emotional Speech Corpus for Persuasive Dialogue System
Sara Asai | Koichiro Yoshino | Seitaro Shinagawa | Sakriani Sakti | Satoshi Nakamura
Proceedings of the 12th Language Resources and Evaluation Conference

Expressing emotion is known as an efficient way to persuade one’s dialogue partner to accept one’s claim or proposal. Emotional expression in speech can express the speaker’s emotion more directly than using only emotion expression in the text, which will lead to a more persuasive dialogue. In this paper, we built a speech dialogue corpus in a persuasive scenario that uses emotional expressions to build a persuasive dialogue system with emotional expressions. We extended an existing text dialogue corpus by adding variations of emotional responses to cover different combinations of broad dialogue context and a variety of emotional states by crowd-sourcing. Then, we recorded emotional speech consisting of of collected emotional expressions spoken by a voice actor. The experimental results indicate that the collected emotional expressions with their speeches have higher emotional expressiveness for expressing the system’s emotion to users.

pdf bib
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Dorothee Beermann | Laurent Besacier | Sakriani Sakti | Claudia Soria
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

pdf bib
Cross-Lingual Machine Speech Chain for Javanese, Sundanese, Balinese, and Bataks Speech Recognition and Synthesis
Sashi Novitasari | Andros Tjandra | Sakriani Sakti | Satoshi Nakamura
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

Even though over seven hundred ethnic languages are spoken in Indonesia, the available technology remains limited that could support communication within indigenous communities as well as with people outside the villages. As a result, indigenous communities still face isolation due to cultural barriers; languages continue to disappear. To accelerate communication, speech-to-speech translation (S2ST) technology is one approach that can overcome language barriers. However, S2ST systems require machine translation (MT), speech recognition (ASR), and synthesis (TTS) that rely heavily on supervised training and a broad set of language resources that can be difficult to collect from ethnic communities. Recently, a machine speech chain mechanism was proposed to enable ASR and TTS to assist each other in semi-supervised learning. The framework was initially implemented only for monolingual languages. In this study, we focus on developing speech recognition and synthesis for these Indonesian ethnic languages: Javanese, Sundanese, Balinese, and Bataks. We first separately train ASR and TTS of standard Indonesian in supervised training. We then develop ASR and TTS of ethnic languages by utilizing Indonesian ASR and TTS in a cross-lingual machine speech chain framework with only text or only speech data removing the need for paired speech-text data of those ethnic languages.

2018

pdf bib
Dialogue Scenario Collection of Persuasive Dialogue with Emotional Expressions via Crowdsourcing
Koichiro Yoshino | Yoko Ishikawa | Masahiro Mizukami | Yu Suzuki | Sakriani Sakti | Satoshi Nakamura
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Construction of English-French Multimodal Affective Conversational Corpus from TV Dramas
Sashi Novitasari | Quoc Truong Do | Sakriani Sakti | Dessi Lestari | Satoshi Nakamura
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unsupervised Counselor Dialogue Clustering for Positive Emotion Elicitation in Neural Dialogue System
Nurul Lubis | Sakriani Sakti | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

Positive emotion elicitation seeks to improve user’s emotional state through dialogue system interaction, where a chat-based scenario is layered with an implicit goal to address user’s emotional needs. Standard neural dialogue system approaches still fall short in this situation as they tend to generate only short, generic responses. Learning from expert actions is critical, as these potentially differ from standard dialogue acts. In this paper, we propose using a hierarchical neural network for response generation that is conditioned on 1) expert’s action, 2) dialogue context, and 3) user emotion, encoded from user input. We construct a corpus of interactions between a counselor and 30 participants following a negative emotional exposure to learn expert actions and responses in a positive emotion elicitation scenario. Instead of relying on the expensive, labor intensive, and often ambiguous human annotations, we unsupervisedly cluster the expert’s responses and use the resulting labels to train the network. Our experiments and evaluation show that the proposed approach yields lower perplexity and generates a larger variety of responses.

2017

pdf bib
Local Monotonic Attention Mechanism for End-to-End Speech And Language Processing
Andros Tjandra | Sakriani Sakti | Satoshi Nakamura
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recently, encoder-decoder neural networks have shown impressive performance on many sequence-related tasks. The architecture commonly uses an attentional mechanism which allows the model to learn alignments between the source and the target sequence. Most attentional mechanisms used today is based on a global attention property which requires a computation of a weighted summarization of the whole input sequence generated by encoder states. However, it is computationally expensive and often produces misalignment on the longer input sequence. Furthermore, it does not fit with monotonous or left-to-right nature in several tasks, such as automatic speech recognition (ASR), grapheme-to-phoneme (G2P), etc. In this paper, we propose a novel attention mechanism that has local and monotonic properties. Various ways to control those properties are also explored. Experimental results on ASR, G2P and machine translation between two languages with similar sentence structures, demonstrate that the proposed encoder-decoder model with local monotonic attention could achieve significant performance improvements and reduce the computational complexity in comparison with the one that used the standard global attention architecture.

2016

pdf bib
Construction of Japanese Audio-Visual Emotion Database and Its Application in Emotion Recognition
Nurul Lubis | Randy Gomez | Sakriani Sakti | Keisuke Nakamura | Koichiro Yoshino | Satoshi Nakamura | Kazuhiro Nakadai
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Emotional aspects play a vital role in making human communication a rich and dynamic experience. As we introduce more automated system in our daily lives, it becomes increasingly important to incorporate emotion to provide as natural an interaction as possible. To achieve said incorporation, rich sets of labeled emotional data is prerequisite. However, in Japanese, existing emotion database is still limited to unimodal and bimodal corpora. Since emotion is not only expressed through speech, but also visually at the same time, it is essential to include multiple modalities in an observation. In this paper, we present the first audio-visual emotion corpora in Japanese, collected from 14 native speakers. The corpus contains 100 minutes of annotated and transcribed material. We performed preliminary emotion recognition experiments on the corpus and achieved an accuracy of 61.42% for five classes of emotion.

2015

pdf bib
An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering
Kyoshiro Sugiyama | Masahiro Mizukami | Graham Neubig | Koichiro Yoshino | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the Tenth Workshop on Statistical Machine Translation

pdf bib
The NAIST English speech recognition system for IWSLT 2015
Michael Heck | Quoc Truong Do | Sakriani Sakti | Graham Neubig | Satoshi Nakamura
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Improving translation of emphasis with pause prediction in speech-to-speech translation systems
Quoc Truong Do | Sakriani Sakti | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

pdf bib
Syntax-based Simultaneous Translation through Prediction of Unseen Syntactic Constituents
Yusuke Oda | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Improving Pivot Translation by Remembering the Pivot
Akiva Miura | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Ckylark: A More Robust PCFG-LA Parser
Yusuke Oda | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf bib
Semantic Parsing of Ambiguous Input through Paraphrasing and Verification
Philip Arthur | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Transactions of the Association for Computational Linguistics, Volume 3

We propose a new method for semantic parsing of ambiguous and ungrammatical input, such as search queries. We do so by building on an existing semantic parsing framework that uses synchronous context free grammars (SCFG) to jointly model the input sentence and output meaning representation. We generalize this SCFG framework to allow not one, but multiple outputs. Using this formalism, we construct a grammar that takes an ambiguous input string and jointly maps it into both a meaning representation and a natural language paraphrase that is less ambiguous than the original input. This paraphrase can be used to disambiguate the meaning representation via verification using a language model that calculates the probability of each paraphrase.

2014

pdf bib
Acquiring a Dictionary of Emotion-Provoking Events
Hoa Trong Vu | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf bib
Linguistic and Acoustic Features for Automatic Identification of Autism Spectrum Disorders in Children’s Narrative
Hiroki Tanaka | Sakriani Sakti | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Rule-based Syntactic Preprocessing for Syntax-based Machine Translation
Yuto Hatakoshi | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

pdf bib
Discriminative Language Models as a Tool for Machine Translation Error Analysis
Koichi Akabe | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Reinforcement Learning of Cooperative Persuasive Dialogue Policies using Framing
Takuya Hiraoka | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Collection of a Simultaneous Translation Corpus for Comparative Analysis
Hiroaki Shimizu | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the collection of an English-Japanese/Japanese-English simultaneous interpretation corpus. There are two main features of the corpus. The first is that professional simultaneous interpreters with different amounts of experience cooperated with the collection. By comparing data from simultaneous interpretation of each interpreter, it is possible to compare better interpretations to those that are not as good. The second is that for part of our corpus there are already translation data available. This makes it possible to compare translation data with simultaneous interpretation data. We recorded the interpretations of lectures and news, and created time-aligned transcriptions. A total of 387k words of transcribed data were collected. The corpus will be helpful to analyze differences in interpretations styles and to construct simultaneous interpretation systems.

pdf bib
Towards Multilingual Conversations in the Medical Domain: Development of Multilingual Medical Data and A Network-based ASR System
Sakriani Sakti | Keigo Kubo | Sho Matsumiya | Graham Neubig | Tomoki Toda | Satoshi Nakamura | Fumihiro Adachi | Ryosuke Isotani
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper outlines the recent development on multilingual medical data and multilingual speech recognition system for network-based speech-to-speech translation in the medical domain. The overall speech-to-speech translation (S2ST) system was designed to translate spoken utterances from a given source language into a target language in order to facilitate multilingual conversations and reduce the problems caused by language barriers in medical situations. Our final system utilizes a weighted finite-state transducers with n-gram language models. Currently, the system successfully covers three languages: Japanese, English, and Chinese. The difficulties involved in connecting Japanese, English and Chinese speech recognition systems through Web servers will be discussed, and the experimental results in simulated medical conversation will also be presented.

pdf bib
Optimizing Segmentation Strategies for Simultaneous Speech Translation
Yusuke Oda | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Towards High-Reliability Speech Translation in the Medical Domain
Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura | Yuji Matsumoto | Ryosuke Isotani | Yukichi Ikeda
The First Workshop on Natural Language Processing for Medical and Healthcare Fields

pdf bib
The NAIST English speech recognition system for IWSLT 2013
Sakriani Sakti | Keigo Kubo | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the 10th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the NAIST English speech recognition system for the IWSLT 2013 Evaluation Campaign. In particular, we participated in the ASR track of the IWSLT TED task. Last year, we participated in collaboration with Karlsruhe Institute of Technology (KIT). This year is our first time to build a full-fledged ASR system for IWSLT solely developed by NAIST. Our final system utilizes weighted finitestate transducers with four-gram language models. The hypothesis selection is based on the principle of system combination. On the IWSLT official test set our system introduced in this work achieves a WER of 9.1% for tst2011, 10.0% for tst2012, and 16.2% for the new tst2013.

pdf bib
Constructing a speech translation system using simultaneous interpretation data
Hiroaki Shimizu | Graham Neubig | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

There has been a fair amount of work on automatic speech translation systems that translate in real-time, serving as a computerized version of a simultaneous interpreter. It has been noticed in the field of translation studies that simultaneous interpreters perform a number of tricks to make the content easier to understand in real-time, including dividing their translations into small chunks, or summarizing less important content. However, the majority of previous work has not specifically considered this fact, simply using translation data (made by translators) for learning of the machine translation system. In this paper, we examine the possibilities of additionally incorporating simultaneous interpretation data (made by simultaneous interpreters) in the learning process. First we collect simultaneous interpretation data from professional simultaneous interpreters of three levels, and perform an analysis of the data. Next, we incorporate the simultaneous interpretation data in the learning of the machine translation system. As a result, the translation style of the system becomes more similar to that of a highly experienced simultaneous interpreter. We also find that according to automatic evaluation metrics, our system achieves performance similar to that of a simultaneous interpreter that has 1 year of experience.

pdf bib
Incremental unsupervised training for university lecture recognition
Michael Heck | Sebastian Stüker | Sakriani Sakti | Alex Waibel | Satoshi Nakamura
Proceedings of the 10th International Workshop on Spoken Language Translation: Papers

In this paper we describe our work on unsupervised adaptation of the acoustic model of our simultaneous lecture translation system. We trained a speaker independent acoustic model, with which we produce automatic transcriptions of new lectures in order to improve the system for a specific lecturer. We compare our results against a model that was trained in a supervised way on an exact manual transcription. We examine four different ways of processing the decoder outputs of the automatic transcription with respect to the treatment of pronunciation variants and noise words. We will show that, instead of fixating the latter informations in the transcriptions, it is of advantage to let the Viterbi algorithm during training decide which pronunciations to use and where to insert which noise words. Further, we utilize word level posterior probabilities obtained during decoding by weighting and thresholding the words of a transcription.

2012

pdf bib
The NAIST machine translation system for IWSLT2012
Graham Neubig | Kevin Duh | Masaya Ogushi | Takamoto Kano | Tetsuo Kiso | Sakriani Sakti | Tomoki Toda | Satoshi Nakamura
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the NAIST statistical machine translation system for the IWSLT2012 Evaluation Campaign. We participated in all TED Talk tasks, for a total of 11 language-pairs. For all tasks, we use the Moses phrase-based decoder and its experiment management system as a common base for building translation systems. The focus of our work is on performing a comprehensive comparison of a multitude of existing techniques for the TED task, exploring issues such as out-of-domain data filtering, minimum Bayes risk decoding, MERT vs. PRO tuning, word alignment combination, and morphology.

pdf bib
The KIT-NAIST (contrastive) English ASR system for IWSLT 2012
Michael Heck | Keigo Kubo | Matthias Sperber | Sakriani Sakti | Sebastian Stüker | Christian Saam | Kevin Kilgour | Christian Mohr | Graham Neubig | Tomoki Toda | Satoshi Nakamura | Alex Waibel
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the KIT-NAIST (Contrastive) English speech recognition system for the IWSLT 2012 Evaluation Campaign. In particular, we participated in the ASR track of the IWSLT TED task. The system was developed by Karlsruhe Institute of Technology (KIT) and Nara Institute of Science and Technology (NAIST) teams in collaboration within the interACT project. We employ single system decoding with fully continuous and semi-continuous models, as well as a three-stage, multipass system combination framework built with the Janus Recognition Toolkit. On the IWSLT 2010 test set our single system introduced in this work achieves a WER of 17.6%, and our final combination achieves a WER of 14.4%.

pdf bib
A method for translation of paralinguistic information
Takatomo Kano | Sakriani Sakti | Shinnosuke Takamichi | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

This paper is concerned with speech-to-speech translation that is sensitive to paralinguistic information. From the many different possible paralinguistic features to handle, in this paper we chose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training a regression model that maps source language duration and power information into the target language. We evaluate the proposed method on a digit translation task and show that paralinguistic information in input speech appears in output speech, and that this information can be used by target language speakers to detect emphasis.

2009

pdf bib
Network-based speech-to-speech translation
Chiori Hori | Sakriani Sakti | Michael Paul | Noriyuki Kimura | Yutaka Ashikari | Ryosuke Isotani | Eiichiro Sumita | Satoshi Nakamura
Proceedings of the 6th International Workshop on Spoken Language Translation: Papers

This demo shows the network-based speech-to-speech translation system. The system was designed to perform realtime, location-free, multi-party translation between speakers of different languages. The spoken language modules: automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS), are connected through Web servers that can be accessed via client applications worldwide. In this demo, we will show the multiparty speech-to-speech translation of Japanese, Chinese, Indonesian, Vietnamese, and English, provided by the NICT server. These speech-to-speech modules have been developed by NICT as a part of A-STAR (Asian Speech Translation Advanced Research) consortium project1.

2008

pdf bib
Development of Indonesian Large Vocabulary Continuous Speech Recognition System within A-STAR Project
Sakriani Sakti | Eka Kelana | Hammam Riza | Shinsuke Sakai | Konstantin Markov | Satoshi Nakamura
Proceedings of the Workshop on Technologies and Corpora for Asia-Pacific Speech Translation (TCAST)