Transcribing Vocal Communications of Domestic Shiba lnu Dogs

How animals communicate and whether they have languages is a persistent curiosity of human beings. However, the study of animal communications has been largely restricted to data from ﬁeld recordings or in a controlled environment, which is expensive and limited in scale and variety. In this paper, we take domestic Shiba Inu dogs as an example and extract their vocal communications from a large amount of YouTube videos of Shiba Inu dogs. We classify these clips into different scenarios and locations, and further transcribe the audio into phonetically symbolic scripts through a systematic process. We discover consistent phonetic symbols among their expressions, which indicates that Shiba Inu dogs can have systematic verbal communication patterns. This reusable framework produces the ﬁrst-of-its-kind Shiba Inu vocal communication dataset that will be valuable to future research in both zoology and linguistics.


Introduction
It has long been an interesting interdisciplinary scientific challenge to understand the languages of animals (Hockett, 1959;Radick, 2007;Von Glasersfeld, 1974).Dogs, who are arguably the best friends of humans, have drawn particular attention.Learning what dogs want to express has broad and profound significance, such as in understanding biological evolution (Pongrácz, 2017), for applying their languages to information technology, or sometimes just for satisfying the our curiosity.
Vocal expressions of dogs, being their chief means of communication, have been studied previously.Here we define vocal expressions as all the sounds that a dog can make vocally, including bark, whine, whimper, howl, huff, growl, yelp, and yip.It has been shown that dogs can recognize the scenes and express their understandings of the outer world as well as their inner states by their voices (Molnár et al., 2008;Hantke et al., 2018).The limitations of previous works are from two aspects.On the one hand, previous research treats this task as a simple classification problem, which means that an audio segment containing barks will be straightforwardly sent into one model to get a particular label such as emotion (happy or sad).Although the results of them have shown that dogs have consistent sound patterns for different purposes, they provided little insight into exploring whether dogs have structural languages.The potential linguistic patterns beyond the dog's vocal expressions are dramatically ignored.On the other hand, previous datasets are collected by recording the voices of dogs under certain controlled environments.Such methodology is costly in practice, and the data thus produced is limited in size and variety (as we will show later in Table 1).In this way, it's hard to infer the latent linguistic patterns.The patterns and semantic meanings of some environments not covered in these datasets will not be investigated as well.

Acoustic Features Transcript Barks Semantics
Figure 1: We aim at matching the barks of dogs with its semantic meanings.In our approach, the barks of dogs will be transcribed into symbols.
Even though it is still highly debatable whether animals, or dogs in this case, have languages at all, in this paper, we present an approach pipeline to treat dog sounds as a kind of language, similar to human languages.During this process (Figure 1), the specific patterns found in their vocal expressions imply that their barking sounds can carry corresponding semantic meanings just as humans use fixed sound patterns to signify.In this paper, we present a dataset of phonetically symbolic transcripts of Shiba Inu dog barks1 called ShibaScript, which ameliorates some of the aforementioned challenges.We pick Shiba Inu as the subject because it is a popular breed around the world and there are a large number of their videos on the web.Meanwhile, we provide preliminary phonetic analysis on this dataset.We believe that this work is a first step toward investigating whether dogs have sound-actuated language just as humans with speech.
ShibaScript contains barks coming from 16 different Shiba Inu dogs, corresponding transcripts with timestamps of their barks, among which consistent sound patterns are found.These 16 dogs are respectively from 16 families who post these dogs' videos on YouTube.The dataset has a total length of over 4 hours of pure dog sound production, 4469 sentences, and 7761 words.There are in total 9 distinct syllables in these transcripts.Note that due to the ever-evolving nature of social media, the dataset-construction methodology we propose in this paper can be applied to YouTube continuously and yields a dataset that is growing in size and variety.We believe this dataset will help with future research on canine communication as well as any general audience who are interested in learning what dogs want to express.
Our contributions lie in three aspects: 1. we introduce a reusable framework for transcribing animal voices from social media like YouTube, the framework is the first to assign phonetic symbols on dog barks as well as describe dogs' vocal communication in a formal way; 2. we release a novel Shiba Inu voice transcription dataset2 , which is the first of its kind in the CL community; 3. we present some preliminary statistical findings from this dataset.9 consis-tent phonetic symbols are discovered, with phonemes/words/sentences being existent, The consistent sound patterns found over the these dogs reveal that dogs may have structural vocal communication patterns.

Approach
We now describe the method of constructing ShibaScript.To collect these clean Shiba Inu barks and endow them with corresponding transcripts, a six-step process is used.These steps, in sequence, are getting videos related to Shiba Inu dogs, extracting barks as "sentences" removing barks with noise, extracting barks as "words", separating syllables, and clustering to assign appropriate phonemes based on their acoustic features.

Collecting Data
In this work, we aim at investigating the language patterns of Shiba Inu dogs.Previous works (Ide et al., 2021;Ehsani et al., 2018;Molnár et al., 2008;Hantke et al., 2018) endeavor to understand dog language patterns conduct experiments on datasets (Table 1) which have limited sizes and scenes.Their frequent approach is to get several dogs and record their barks when dogs are put into the context of different events and in various kinds of places.The disadvantages of this method are three-fold.First, the number of dogs is limited by the budget and practical conditions of these experiments.Second, such an approach can only include several "typical scenarios", and is almost impossible to cover all of the situations that dogs might experience in their daily lives.Third, field study like this is costly in terms of humans, machines, and time.Therefore it is hard to transfer the research to other animal species.
To solve these problems, we make use of the abundant resources from online social media.Each year, millions of videos are uploaded to YouTube, which is the largest video-sharing site around the world.These videos include large amounts of Shiba Inu dogs videos of different scenes uploaded by those who keep them.There are even people who set up an account specifically for dogs and upload hundreds or thousands of their videos.Collecting data from such Shiba Inu enthusiasts can substantially enlarge the number of dogs, cover more scenes, and reduce the cost.And most importantly, researchers can adapt this methodology to other dog breeds or even animal species, which Name Type # of Dogs Scenes Activities Size Full Dataset (Ide et al., 2021) video, audio, sensor -simulated disaster sites -2825s DECADE (Ehsani et al., 2018) video, audio, sensor 1 indoors and outdoors -4864s Unknown (Molnár et al., 2008) audio 14 mostly indoors, street -6,646 barks EmoDog (Hantke et al., 2018) audio 12 7 fixed types -9,447s ShibaScript audio, link 16 37 44 14,702s : The number of scenes and activities in ShibaScript is not fixed and can be expanded as the dataset is continuously collected.
Table 1: Dog-voice data sources used previously.Existing datasets are collected by manual recording.The first two contain videos of various lengths, while the latter two contain a certain amount of pure barks with pauses.means this approach is highly reusable.
We select 16 users who have uploaded plentiful Shiba Inu dogs videos and have relatively good recording conditions.These videos are the raw data.

Extracting Sentences
What we care and label transcripts for are the moments when dogs make any vocal expressions.Similar to humans, it is possible to define the sentence in the sound system of dog expressions.The definition is as below: In a sentence, dogs bark continuously on the granularity of seconds.Barks here represent the sounds dogs generate through vocal cord vibration.
In the videos we obtain from different YouTube users, there are a lot of irrelevant and silent frames when the concentrations of videos are not dogs or the dog in the current frame is not barking.
In order to extract the video clips containing vocal expressions of dogs.We use PANNs (Kong et al., 2020), a pretrained large sound event detection model including as many as 527 sound classes that can output audio tagging results as well as events' on-and off-timestamps.Those frames which are tagged with "bark" in the top 10 results are considered to contain barks.We manually labeled 300 samples and compared them with PANNs output, a precision of 0.92 is observed.

Removing Noises
In constructing the dataset, there is an apparent advantage of recording the audio of dogs in reality: the background noise and the conditions of the recording device can be better controlled.In this work, since we pursue better coverage of the dataset and use the resources from public social media, the problem of noises in the audio samples is inevitable.
To generate the scripts and statistical results more accurately, we have tried our best to produce clean dog bark samples from two aspects: first in Section 2.1, we have selected the users who uploaded videos with less noise and better recording conditions; and second, we use the following approach to significantly remove the noise from our data.
From artificial sampling, we find that the majority of the noises come from either the background music which the user edited into the video, or the human talking while the dog was barking.In order to remove this kind of noise, we make use of the result of PANNs as well.Those frames which are tagged with "speech" or "music" in the top 10 results are considered noisy frames.Sentences that contain noisy frames are filtered out.

Extracting Words
In the vocal expressions of dogs, there are mainly long pauses and short pauses.A long audio sample can be divided into several sentences with long pauses in between; a sentence can be further divided into several words with short pauses in between like in Figure 2. We can define "words" in dogs statistically: In a word, dogs bark continuously on the granularity of microseconds.As mentioned in Section 2.2, the pre-trained model PANNs (Kong et al., 2020) performed well on the task of sound event detection.Besides the small-grid pauses, there may also be some noise that failed to be filtered in the previous step.To eliminate such small-grid pauses and noise, here we directly detect the "barking" event from the sentences, and do the word-level splitting based on it.In Hershey et al. ( 2021), The authors picked out a subset of audios from the original AudioSet (Gemmeke et al., 2017) and assigned "strong" labels to them(about 0.1 sec resolution).The strong-labeled subset of AudioSet results in improved model performance.
We first trained a uniform model from PANNs for sound event detection on the strong-labeled subset of AudioSet.Then to extract words out of the sentence, we annotated strong labels on the event "barking", for 246 sentences with a total length of 715 seconds by the phonetic analysis tool Praat (Boersma and Van Heuven, 2001) and finetuned the pre-trained model.As shown in Figure 3, the finetuned model is used to detect the "barking" event and based on the onset and offset of the event, we can extract words from sentences and eliminate the short pauses.

Separating Syllables
In human speech, we have the minimum unit as a phoneme that can construct syllables and words, based on which we form sentences with grammatical rules.We retain this setting in exploring dog language and define their barking sounds from the minimal unit, phonetic symbols (Rohrmeier et al., 2015).However, as dogs have different articulatory anatomy from humans, the sounds can be vastly different.We try to label dog sound excerpts with International Phonetic Alphabet (IPA).
In Räsänen et al. (2018), the authors introduce that it is possible to do syllabification even when no priory linguistic knowledge exists.The way to segment speech into syllable-like units depends on sonority to show the edges of syllables (Figure 4).Considering the fact that current dog voices are without any known language patterns, we can adopt this method to separate syllables in one word.

Clustering and Phonemes Assignment
Given all these syllables and the assumption that dogs have a special system of syllables, we can do clustering and matching up to find a coexisting alphabet for Shiba Inu dogs.As these 16 dogs have different sex, ages, and physical conditions, we conduct Spectral Clustering (Von Luxburg, 2007) on syllables from one certain dog respectively.The feature we use is Filterbank (Strang and Nguyen, 1996).Generally, we set the number of clusters according to the number of videos of each dog, from 10 to 20 (the more videos, the more clusters).The clustering results after dimension reduction can be seen in Figure 5: After clustering, we have found that compared to human languages, dogs have fewer phonetic categories, which is understandable because humans have a more complex vocal system.Aggregating all the clustering results together, we refer to IPA for illustration and find nine consistent syllables (Table 2).After setting up the syllables dictionary, we can reversely get the words transcripts with short pauses, sentences transcripts with long pauses and audios transcripts with pauses.A typical symbolic transcript of one audio sentence can be in Figure 6.

Data Scale
With the hierarchical structure as audios, sentences, words and syllables, we have given each of the barks of Shiba Inu dogs symbolic transcripts.The distributions of each tier are shown in Table 3.As the whole videos are got from open public media YouTube, they contain a large excess of nonlabeling fragments, when the dog doesn't bark or some noise such as human speech and background  music.What we concern more are those barking fragments, that is the "sentences."We can take the length of sentences as the length of our dataset.At the same time, because we obtain our data from YouTube, the dataset size can grow over time with more users uploading videos.

Data Variety
Shiba Inu is a very common and lively breed of dog, many people like them and keep them as pets.Those hosts live with their dogs and record their daily lives through videos.As the dataset ShibaScipt is transcribed from the audios extracted from life recording videos on YouTube, the dogs may appear in a variety of common and even uncommon scenes rather than a limited set of scenes, and they may be doing many activities.Therefore, ShibaScript covers a very diverse set of scenes and activities, including 37 different scenes and 44 different activities for dogs.What's more, unlike other datasets which record audios in fixed Figure 6: The script of the sentence in the introduction, containing the id of this sentence, the source audio id, the time of this sentence in the audio, the 5 words and their information in this sentence.Each word in the "transcript" is splitted by ";".scenes or manually, the scenes and activities covered by ShibaScript can be expanded as the dataset is continuously collected.
Figure 7 shows the scenes and activities covered by ShibaScript.We can find that there is a subset of the activities that appear in the vast majority of users' videos.For pet dogs, their daily activities such as walking, running, and sleeping are essential and common, and their owners may also record these activities, so these daily activities are covered by most of the users.This holds for the statistical results of scenes as well, that common scenes in daily life like "quilt", "road", "bedroom", "dog bowl" appear in the vast majority of users' videos.Benefiting from the large number of videos used to transcribe the dataset, ShibaScript covers the vast majority of everyday scenes and activities.
Besides, there are some activities and scenes that appear rarely in the statistics.These activities and  scenes are shown as "others" in Figure 7.There are two possible reasons why an activity or scene appears infrequently.First of all, it is highly possible that this activity or scene is related to the personal characteristics of the user.For example, a dog has to wear a cone collar to prevent the dog from licking the wound, so the activity "wear a cone collar" appears only when the dog has had surgery, and this event is not a common one.The second reason is that users have different shooting habits, and a user may only record videos in certain scenes or activities.For example, some users only take indoor videos, so some outdoor activities and scenes like "dig sand" and "beach" are not possible to be covered in their videos, even if the dog actually participated in those activities or scenes.These activities and scenes with personal characteristics greatly expand the diversity of ShibaScript, so that it can cover some non-daily activities and scenes.Benefiting from the wide range of dogs, we can investigate a universal sound pattern of dogs, as they are extracted via them doing various activities under different scenes.

Analysis
We present preliminary statistical findings from ShibaScript, including lexical analysis and transcribing accuracy evaluation.

Lexical Analysis
During the transcribing, there are in total 11 types of tokens, in which 9 types are phonetic symbols (Table 2), the other two are short pauses be-  tween words and long pauses between sentences.Similar to humans, the length of these tokens contain ample information.The exact lengths of tokens are kept in ShibaScript for concrete analysis.Because the long pauses are largely determined by the scene at that time, the numerical analysis of it will not be included here.
The mean and variance of each token length can be seen in Table 4.In which we find that almost every phonetic symbol has a similar length of 0.35s or so.Except for the phonetic symbol [u:], which is a prolonged sound owning an average length of 0.45s.While phonetic symbol [k] is a relatively short-lived symbol, only having 0.24s average length.
Considering the monogram (Figure 8) of ShibaScript, we can find that the most frequent symbol is [en]  After analyzing the monogram, we come to find the relationship between symbols, as well as the bigram (Figure 9) of ShibaScript.Among these bigrams, several appear extremely frequently.It shows a possibility that they are associated with some common semantic meanings.We will dive into that in the future works.Due to space constraints, the detailed information of bigram is shown in Section B.

Accuracy of Transcription
In this paper, we discover the certain phonetic pattern of Shiba Inu dogs and assign a vocal dictio- nary of 9 symbols, which is a first-step trial in this area.To better evaluate the phonetic symbols set as well as the integral accuracy of our transcribing, we have done an evaluation test on these two aspects.The evaluation metric is 5-level Mean Opinion Score (Viswanathan and Viswanathan, 2005).Three raters will give scores to either one syllable or one word according to Table 5.

Score Description 5
The label exactly matches up. 4 Some difference exists between the label and the sound.Humans are sometimes hard to distinguish.3 Difference exists between the label and the sound.Humans can tell the difference immediately.2 Although the label is obviously wrong, there is similarity between the label and the sound. 1 The label is totally wrong.
Table 5: The evaluation metric of rating, which is similar to MOS in speech synthesis evaluation metric.

Phonetic Symbol Accuracy Evaluation
For each syllable category, we select 50 syllables randomly.The rating result is in Figure 10.The Fleiss Kappa (Kılıç, 2015) between three annotators is 0.609.

Word Accuracy Evaluation
For the word accuracy evaluation, we select 30 words for each dog randomly and find the same person who rates for phonetic symbols to score for them.The rating result is in Figure 11.The Fleiss Kappa between three annotators is 0.516.

Related Work
Early works on understanding animal communications have never reached a point of maturity, which have direct connections between their vocal or literal expressions and their meanings.In these works, researchers attempted to interpret animals in a certain aspect through classifications.Among animals, dogs are popular as research subjects.Considering their vocal expressions, these researches can be divided into mainly three kinds: activity understanding (Ide et al., 2021;Ehsani et al., 2018;Molnár et al., 2008), emotion understanding (Hantke et al., 2018;Paladini, 2020) and individual understanding (Larranaga et al., 2015).The situation above comes from two reasons.The first is that we are short of ample dataset related to the expressions of dogs, and the second reason is that we have never mastered, or seldom investigated the language patterns of dogs.
In some datasets (Parkhi et al., 2012;Iwashita et al., 2014;Abu-El-Haija et al., 2016) related to visual information of dogs, abundant data was collected from the Internet, which saved the cost and made the data extensible.Compared to that, previous vocal-related datasets depended on manual recordings, which limits the context and costs a lot.
Given this, a thought is that we can utilize data on the Internet when collecting vocal-related data if we design a systematic process to extract useful fragments.
In the meantime, previous research adopted a straight-forward classification method, thus lacked enough investigation into the potential sound patterns of dogs.While lexical analysis (Yule, 2022) is the fundamental step for language processing, another thought is that we can set up an own "alphabet" for dogs and transcribe barks of dogs into readable tokens for further research.

Conclusion
In this work, we introduce an unprecedented approach for transcribing vocal communications of Shiba Inu dogs and release a corresponding dataset ShibaScript.Compared to the former approaches, it can save a lot of cost and make the dataset extensible.The approach can be transferred to other animals easily.And most importantly, the method is the first-step into investigating the vocal patterns of dogs, bringing inspiration to the field of animal understanding.
We also make some preliminary statistical evaluation and analysis on ShibaScript.The evaluation shows that our symbol assignments in those transcripts are consistent.In the analysis part, we have shown some interesting findings related to the lexical distribution.For future work, we can further research the semantic meanings of dog vocal expressions because we have obtained the corresponding videos of dog vocal expressions.

Limitations
Dataset Noise As the audios are obtained from the videos on YouTube, the quality of the videos will have an impact on the quality of the final transcript.For example, inferior recording equipment may affect the quality of the sound, although we have done noise removal to keep the quality, the presence of background noise will cause some losses in the transcribing process.

Relationship Between Transcripts and Scenes
In this work we get the transcript of Shiba Inu dogs, and we also find that the dataset covers a variety of activities and scenes.There may be an interesting relationship between the dog vocal units and the environment including the scene and activity.However, we did not quantitatively analyze the relationship.Considerably more work will need to be done to discover semantic information in dog barks.Phoneme Labeling Accuracy In Section 2.6 we cluster the syllables and assign phonetic symbols to them.Then in Section 4.2.1 we evaluate the result by MOS.It can be seen in Figure 10 that the accuracy score is not very high, which can be improved in our future work.

Ethics Statement
This paper makes use of only open-source video data from YouTube.During the transcribing we only focus on the dog barkings, make no use of the personal information of the users, so the released dataset ShibaScript does not contain any personal information, hence doesn't breach the privacy of any persons.

A Clustering Visualization
The full results of clustering can be seen in Figure 12.

B Bigram Statistical Result
Because of the space restrictions, we don't show the detailed results in the main paper.The complete result is in Table 6

C Activities and Scenes Covered by ShibaScript
44 activities and 37 scenes are covered by ShibaScript.The full statistics of them are in Table 7.  sth up, roll, lick, stretch, play with toys, play with dogs, sneeze, sniff, walk with a wheelchair, be held, be petted, listen to music, play with people, die, wears a muzzle, wade in water, be medicated, bow, bask, watch fireworks, play with cats, dig sand, climb the mountain, be vacuumed, sprawl, dig the snow, has its teeth be brushed, hum in the sleep, squat, cut nails, wear a cone collar, surf, wag the tail, blow, pee, be massaged, has its fur be brushed D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.

Figure 2 :
Figure 2: The result of sentence-level and word-level split of a complete audio sample.

Figure 3 :
Figure 3: The SED model predicts whether the event "barking" existed in one frame.Words are extracted from the sentences by the onset and the offset.

Figure 4 :
Figure 4: Here a word is separated into four syllables based on the sonority.The complete transcript of this dog is shown in Figure 6.

Figure 5
Figure 5: 2-D Visualization of spectral clustering of one dog's data using t-SNE.The complete clustering of all dogs can be checked in Section A.

Figure 7 :
Figure 7: The activities and scenes that covered by ShibaScript.The area of the patches represents the number of dogs producing this symbol.

Figure 8 :
Figure 8: The occurrences of each monogram.The blue bars show the occurrences across the whole dataset of each monogram in ShibaScript, the green lines show the numbers of dogs producing the symbols, from 1 to 16.

Figure 9 :
Figure 9: The occurrences of each bigram.The blue bars show the occurrences across the whole dataset of each monogram in ShibaScript, the green lines show the numbers of dogs producing the symbols, from 1 to 16.

Figure 12 :
Figure 12: Visualization of Spectral Clustering after TSNE of 16 dogs.The dog's IDs are increasing from left to right, up to down.Phonetic symbols are assigned to different clusters.

Table 2 :
The nine types of syllables as well as the syllables description of Shiba Inu.Every description is a clickable hyperlink to an actual sound sample.

Table 3 :
The basic statistical information of ShibaScript.
, which reaches to 3478 times in ShibaScript, the following two are [au] and [a], .

Table 6 :
The frequency and coverage number of 16 dogs' bigrams.Here Freq.represents for the frequency of one certain bigram, Co. represents for the numbers of dogs who have made this bigram.
room, in the arm, quilt, dog bowl, cage, dining room, bathroom, by the window, lawn, snow, sea, beach, field path, road eat, walk, run, sleep, bark, be petted, be held, play with cats, bath, bask, play with people total 39 44 bedroom, living room, dog bowl, bed, quilt, cage, by the window, under the bed, dining room, bathroom, other animals, stairs, hospital, in the arm, by the fire, cat tree, heating pad, sofa, carpet, door, lawn, beach, sea, woods, field path, road, hill, shrine, shore, cabin, stream, garden, snow, terrace, sightseeing bus, mirror, on the ice, vacuum, other dogs open boxes, bath, eat, walk, run, bark, sleep, pick

Table 7 :
The full statistics for the scenes and activities appearing in each user.The order of the items in column "Scene" and "Activities" is not statistically significant C2.Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.