Language Modeling with Functional Head Constraint for Code Switching Speech Recognition

In this paper, we propose novel structured language modeling methods for code mixing speech recognition by incorporating a well-known syntactic constraint for switching code, namely the Functional Head Constraint (FHC). Code mixing data is not abundantly available for training language models. Our proposed meth-ods successfully alleviate this core problem for code mixing speech recognition by using bilingual data to train a structured language model with syntactic constraint. Linguists and bilingual speakers found that code switch do not happen be-tween the functional head and its complements. We propose to learn the code mixing language model from bilingual data with this constraint in a weighted ﬁnite state transducer (WFST) framework. The constrained code switch language model is obtained by ﬁrst expanding the search network with a translation model, and then using parsing to restrict paths to those per-missible under the constraint. We implement and compare two approaches - lattice parsing enables a sequential coupling whereas partial parsing enables a tight coupling between parsing and ﬁl-tering. We tested our system on a lecture speech dataset with 16% embedded second language, and on a lunch conversation dataset with 20% embedded language. Our language models with lattice parsing and partial parsing reduce word error rates from a baseline mixed language model by 3.8% and 3.9% in terms of word error rate relatively on the average on the ﬁrst and second tasks respectively. It outperforms the


Introduction
In multilingual communities, it is common for people to mix two or more languages in their speech. A single sentence spoken by bilingual speakers often contains the main, matrix language and an embedded second language. This type of linguistic phenomenon is called "code switching" by linguists. It is increasingly important for automatic speech recognition (ASR) systems to recognize code switching speech as they exist in scenarios such as meeting and interview speech, lecture speech, and conversational speech. Code switching is common among bilingual speakers of Spanish-English, Hindi-English, Chinese-English, and Arabic-English, among others. In China, lectures, meetings and conversations with technical contents are frequently peppered with English terms even though the general population is not considered bilingual in Chinese and English. Unlike the thousands and tens of thousands of hours of monolingual data available to train, for example, voice search engines, transcribed code switch data necessary for training language models is hard to come by. Code switch language modeling is therefore an even harder problem than acoustic modeling.
One approach for code switch speech recognition is to explicitly recognizing the code switch points by language identification first using phonetic or acoustic information, before applying speech recognizers for the matrix and embedded languages (Chan et. al, 2004;Shia et. al, 2004;Lyu and Lyu, 2008). This approach is extremely error-prone as language identification at each frame of the speech is necessary and any error will be propagated in the second speech recognition stage leading to fatal and irrecoverable errors.
Meanwhile, there are two general approaches to solve the problem of lack of training data for language modeling. In a first approach, two language models are trained from both the matrix and embedded language separately and then interpolated together (Vu et. al, 2012;Chan et. al, 2006). However, an interpolated language model effectively allows code switch at all word boundaries without much of a constraint. Another approach is to adapt the matrix language language model with a small amount of code switch data (Tsai et. al, 2010;Yeh et. al, 2010;Bhuvanagiri and Kopparapu, 2010;Cao et. al, 2010). The effectiveness of adaptation is also limited as positions of code switching points are not generalizable from the limited data. Significant progress in speech recognition has been made by using deep neural networks for acoustic modeling and language model. However, improvement thus gained on code switch speech recognition remains very small. Again, we propose that syntactic constraints of the code switching phenomenon can help improve performance and model accuracy. Previous work of using partof-speech tags (Zhang et. al, 2008;Vu et al 2012) and our previous work using syntactic constraints (Li andFung, 2012, 2013) have made progress in this area. Part-of-speech is relatively weak in predicting code switching points. It is generally accepted by linguists that code switching follows the so-called Functional Head Constraint, where words on the nodes of a syntactic sub tree must follow the language of that of the headword. If the headword is in the matrix language then none of its complements can switch to the embedded language.
In this work, we propose two ways to incorporate the Functional Head Constraint into speech recognition and compare them. We suggest two approaches of introducing syntactic constraints into the speech recognition system. One is to apply the knowledge sources in a sequential order. The acoustic model and a monolingual language model are used first to produce an intermediate lattice, then a second pass choose the best result using the syntactic constraints. Another approach uses tight coupling. We propose using structured language model (Chelba and Jelinek, 2000) to build the syntactic structure incrementally.
Following our previous work, we suggest incorporating the acoustic model, the monolingual language model and a translation model into a WFST framework. Using a translation model allows us to learn what happens when a language switches to another with context information. We will motivate and describe this WFST framework for code switching speech recognition in the next section. The Functional Head Constraint is described in Section 3. The proposed code switch language models and speech recognition coupling is described in Section 4. Experimental setup and results are presented in Section 5. Finally we conclude in Section 6.

Code Switch Language Modeling in a WFST Framework
As code switch text data is scarce, we do not have enough data to train the language model for code switch speech recognition. We propose instead to incorporate language model trained in the matrix language with a translation model to obtain a code switch language model. We propose to integrate a bilingual acoustic model (Li et. al, 2011) and the code switch language model in a weighted finite state transducer framework as follows. Suppose X denotes the observed code switch speech vector, w J 1 denotes a word sequence in the matrix language, the hypothesis transcript v I 1 is as follows: where P (X|v I 1 ) is the acoustic model and P (v I 1 ) is the language model in the mixed language.
Our code switch language model is obtained from a translation model P (v I 1 |w J 1 ) from the matrix language to the mixed language, and the language model in the matrix language P (w J 1 ). Instead of word-to-word translation, the transduction of the context dependent lexicon transfer is constrained by previous words. Assume the transduction depends on the previous n words: There are C-level and H-level search networks in the WFST framework. The C-level search network is composed of the universal phone model P , the context model C, the lexicon L, and the grammar G The H-level search network is composed of the state model H, the phoneme model P , the context model C, the lexicon L, and the grammar G The C-level requires less memory then the H-level search network. We propose to use a weighted finite state transducer framework incorporating the bilingual acoustic model P , the context model C, the lexicon L, and the code switching language models G CS into a C-level search network for mixed language speech recognition. The output of the recognition result is in the mixed language after projection π(G CS ).
The WFST implementation to obtain the code switch language model G CS is as follows: where T is the translation model P l (ṽ l |w l ) is the probability of w l translated intoṽ l . In order to make use of the text data in the matrix language to recognize speech in the mixed language, the translation model P (v I 1 |w J 1 ) transduce the language model in the matrix language to the mixed language.
The word-to-phrase segmentation model extracts a table of phrases {ṽ 1 ,ṽ 2 , ...,ṽ K } for the transcript in the embedded language and {w 1 ,w 2 , ...,w K } for the transcript in the matrix language based on word-to-word alignments trained in both directions with GIZA++ (Och and Ney, 2003). The chunk segmentation model performs the segmentation of a phrase sequencew K 1 into L phrases {c 1 , c 2 , ..., c L } using a segmentation weighted finite-state transducer. Assumes that a chunk c l is code-switched to the embedded language independently by each chunk, the chunkto-chunk transduction model is the probability of a chunk to be code switched to the embedded language trained on parallel data. The reconstruction model generates word sequence from chunk sequences and operates in the opposite direction to the segmentation model.

Functional Head Constraint
Many linguistics (Abney 1986;Belazi et. al, 1994;Bhatt 1994) have discovered the so-called Functional Head Constraint in code switching. They have found that code switches between a functional head (a complementizer, a determiner, an inflection, etc.) and its complement (sentence, noun-phrase, verb-phrase) do not happen in natural speech. In addition, the Functional Head Constraint is language independent.
In this work, we propose to investigate and incorporate the Functional Head Constraint into code switching language modeling in a WFST framework. Figure 1 shows one of the Functional Head Constraint examples. Functional heads are the roots of the sub trees and complements are part of the sub trees. Actual words are the leaf nodes. According to the Functional Head Constraint, the leave nodes of a sub tree must be in either the matrix language or embedded language, following the language of the functional head. For instance, the third word "東 西/something" is the head of the constituents "非常/very 重要的/important 東 西/something". These three constituent words cannot be switched. Thus, it is not permissible to code switch in the constituent. More precisely, the language of the constituent is constrained to be the same as the language of the headword. In the following sections, we describe the integration of the Functional Head Constraint and the language model.
We have found this constraint to be empirically sound as we look into our collected code mixing speech and language data. The only violation of the constraint comes from rare cases of borrowed words such as brand names with no translation in the local, matrix language. Borrowed words are used even by monolingual speakers so they are in general part of the matrix language lexicon and require little, if any, special treatment in speech recognition.
In the following sections, we describe the integration of Functional Head Constraint and the language model.

Code Switching Language Modeling with Functional Head Constraint
We propose two approaches of language modeling with Functional Head Constraint: 1) latticeparsing and sequential-coupling (Chapplerler et. al, 1999); 2) partial-parsing and tight-coupling (Chapplerler et. al, 1999). The two approaches will be described in the followed sections.

Sequential-coupling by Lattice-based Parsing
In this first approach, the acoustic models, the code switch language model and the syntactic constraint are incorporated in a sequential order to progressively constrain the search. The acoustic models and the matrix language model are used first to produce an intermediate output. The intermediate output is a lattice in which word sequences are compactly presented. Lattice-based parsing is used to expand the word lattice generated from the first decoding step according to the Functional Head Constraint.
We have reasons to use word lattice instead of N-best hypothesis. The number of hypothesis of word lattice is larger than N-best hypothesis. Moreover, different kinds of errors correspond to the language model would be observed if N-best list is extracted after the first decoding step. The second pass run over the N-best list will prevent the language model with Functional Head Constraint from correcting the errors. In order to obtain a computational feasible number of hypotheses without bias to the language model in the first decoding step, word lattice is used as the intermediate output of the first decoding step.
A Probabilistic Context-Free Grammar (PCFG) parser is trained on Penn Treebank data. The PCFG parser is generalized to take the lattice generated by the recognizer as the input. Figure 2 illustrates a word lattice which is a compact representation of the hypothesis transcriptions of a an input sentence. All the nodes of the word-lattice are ordered by increasing depth.
A CYK table is obtained by associating the arcs with their start and end states in the lattice instead of their sentence position and initialized all the cells in the table corresponding to the arcs (Chapplerler et. al, 1999). Each cell C k,j of the table is filled by a n-tuple of the non-terminal A, the length k and the starting position of the word sequence w j ...w j+k if there exists a PCFG rule A → w j ...w j+k , where A is a non-terminal which parse sequences of words w j ...w j+k . In order to allow all hypothesis transcriptions of word lattice to be taken into account, multiple word sequences of the same length and starting point are initialized in the same cell. Figure 2 mapped the word lattice of the example to the table, where the starting node label of the arc is the column index and the length of the arc is the row index.
The sequential-coupling by lattice-parsing consists of the standard cell-filling and the self-filling steps. First, the cells C k,j and C i−k,j+k are combined to produce a new interpretation for cell C i,j . In order to handle the unary context-free production A → B and update the cells after the standard cell-filling, a n-tuple of A, i and j is added for each n-tuple of the non-terminal B, the length i and the start j in the cell C i,j . The parse trees extracted are associated with the input lattice from the table starting from the non-terminal label of the top cell. After the parse tree is obtained, we re- Hypotheses:* *EM* .* EM*theory.* this*EM*theory.* is*this*EM*theory.* something*is*this*EM*theory.*(not*permissible)* .* .* .*  cursively enumerate all its subtrees. Each subtree is able to code-switch to the embedded language with a translation probability P l (ṽ l |w l ). The lattice parsing operation consists of the an encoding of a given word sequence along with a parse tree (W, T ) and a sequence of elementary model actions. In order to obtain a correct probability assignment P (W, T ) one simply assign proper conditional probabilities to each transition in the weighted finite states.

東西
The probability of a parse T of a word sequence W P (W, T ) can be calculated as the product of the probabilities of the subtrees.
Where W k = w 0 ...w k is the first k words in the sentence, and (W k , T k ) is the word-and-parse kprefix. The probability of the n-tuple of the nonterminal A, the length i and the starting position j is the probability of the subtree corresponding to A parsing throughout the sequence w j ...w j+i−1 . The probability of the partial parsing is the product of probabilities of the subtree parses it is made of. The probability of an n-tuple is the maximum over the probabilities of probable parsing path.
The N most probable parses are obtained during the lattice-parsing.
The probability of a sentence is computed by adding on the probability of each new context-free rule in the sentences.

Tight-coupling by Incremental Parsing
To integrate the acoustic models, language model and the syntactic constraint in time synchronous decoding, an incremental operation is used in this approach. The final word-level probability assigned by our model is calculated using the acoustic models, the matrix language model, the structured language model and the translation model. The structured language model uses probabilistic parameterization of a shift-reduce parse (Chelba and Jelinek, 2000). The tight-coupled language model consists of three transducers, the word predictor, the tagger and the constructor. As shown in Figure 3, W k = w0...wk is the first k words of the sentence, T k contains only those binary subtrees whose leaves are completely included in W k , excluding w 0 =<s>. Single words along with their POS tag can be regarded as root-only trees. The exposed head h k is a pair of the headword of the constituent W k and the non-terminal label. The exposed head of single words are pairs of the words and their POS tags.
Given the word-and-parse (k-1)-prefix W k−1 T k−1 , the new word w k is predicted by the word-predictor P (w k |W k−1 T k−1 ). Taking the word-and-parse k − 1-prefix and the next word as input, the tagger P (t k |w k , W k−1 T k−1 ) gives the POS tag t k of the word w k . Constructor P (p k i |W k T k ) assigns a non-terminal label to the constituent W k+1 . The headword of the newly built constituent is inherited from either the headword of the constituent W k or the next word w k+1 .
The probability of a parse tree T P (W, T ) of a word sequence W and a complete parse T can be calculated as: Where w k is the word predicted by the wordpredictor, t k is the POS tag of the word w k predicted by the tagger, W k−1 T k−1 is the word-parse (k -1)-prefix, T k k−1 is the incremental parse structure that generates T k = T k−1 ||T k k−1 when attached to T k−1 ; it is the parse structure built on top of T k−1 and the newly predicted word wk; the || notation stands for concatenation; N k−1 is the number of operations the constructor executes at position k of the input string before passing control to the word-predictor (the N k th operation at position k is the null transition); N k is a function of T ; p k i denotes the i th constructor action carried out at position k in the word string.
The probability models of word-predictor, tagger and constructor are initialized from the Upenn Treebank with headword percolation and binarization. The headwords are percolated using a context-free approach based on rules of predicting the position of the headword of the constituent. The approach consists of three steps. First a parse tree is decomposed to phrase constituents. Then the headword position is identified and filled in with the actual word percolated up from the leaves of the tree recursively.
Instead of the UPenn Treebank-style, we use a more convenient binary branching tree. The parse trees are binarized using a rule-based approach.
The probability models of the word-predictor, tagger and constructor are trained in a maximization likelihood manner. The possible POS tag assignments, binary branching parse, non-terminal labels and the head-word annotation for a given sentence are hidden. We re-estimate them using EM algorithm.
Instead of generating only the complete parse, all parses for all the subsequences of the sentence are produced. The headwords of the subtrees are code switched to the embedded language with a translation probability P l (ṽ l |w l ) as well as the leaves.

Decoding by Translation
Using either lattice parsing or partial parsing, a two-pass decoding is needed to recognize code switch speech. A computationally feasible first pass generates an intermediate result so that the language model with Functional Head constraint can be used in the second pass. The first decoding pass composes of the transducer of the universal phoneme model P , the transducer C from contextdependent phones to context-independent phones, the lexicon transducer L which maps contextindependent phone sequences to word strings and the transducer of the language model G. A T3 decoder is used in the first pass.
Instead of N-best list, word lattice is used as the intermediate output of the first decoding step.
The language model G CS of the transducer in the second pass is improved from G by composing with the translation model P l (ṽ l |w l ). Finally, the recognition transducer is optimized by determination and minimization operations.

Experimental Setup
The bilingual acoustic model used for our mixed language ASR is trained from 160 hours of speech from GALE Phase 1 Chinese broadcast conversation, 40 hours of speech from GALE Phase 1 English broadcast conversation, and 3 hours of in-house nonnative English data. The acoustic features used in our experiments consist of 39 components (13MFCC, 13MFCC, 13 MFCC using cepstral mean normalization), which are analyzed at a 10msec frame rate with a 25msec window size. The acoustic models used throughout our paper are state-clustered crossword tri-phone HMMs with 16 Gaussian mixture output densities per state. We use the phone set consists of 21 Mandarin standard initials, 37 Mandarin finals, 6 zero initials and 6 extended English phones. The pronunciation dictionary is obtained by modifying Mandarin and English dictionaries using the phone set. The acoustic models are reconstructed For the language models, transcriptions of 18 hours of Data 1 are trained as a baseline mixed language model for the lecture speech domain. 250,000 sentences from Chinese speech conference papers, power point slides and web data are used for training a baseline Chinese matrix language model for the lecture speech domain (LM 1). Transcriptions of 2 hours of Data 2 are used as the baseline mixed language model in the lunch conversation domain. 250,000 sentences of the GALE Phase 1 Chinese conversational speech transcriptions are used to train a Chinese matrix language model (LM 2). 250,000 of GALE Phase 1 English conversational speech transcription are used to train the English embedded language model (LM 3). To train the bilingual translation model, the Chinese Gale Phase 1 conversational speech transcriptions are used to generate a bilingual corpus using machine translation. For comparison, an interpolated language model for the lunch conversation domain is trained from interpolating LM 2 with LM 3. Also for comparison, an adapted language model for lecture speech is trained from LM 1 and transcriptions of 18 hours of Data 1. An adapted language mode l for conversation is trained from LM 2 and 2 hours of Data 2. The size of the vocabulary for recognition is 20k words. The perplexity of the baseline language model trained on the code switching speech transcription is 236 on the lecture speech and 279 on the conversation speech test sets. Table 1 reports precision, recall and F-measure of code switching point in the recognition results of the baseline and our proposed language models. Our proposed code switching language models with functional head constraint improve both precision and recall of the code switching point detection on the code switching lecture speech and lunch conversation 4.48%. Our method by tightcoupling increases the F-measure by 9.38% relatively on the lecture speech and by 6.90% relatively on the lunch conversation compared to the baseline adapted language model. The Table 2 shows the word error rates (WERs) of experiments on the code switching lecture speech and Table 3 shows the WERs on the code switching lunch conversations. Our proposed code switching language model with Functional Head Constraints by sequential-coupling reduces the WERs in the baseline mixed language model by 3.72% relative on Test 1, and 5.85% on Test 2. Our method by tight-coupling also reduces WER by 2.51% relative compared to the baseline language model on Test 1, and by 4.57% on Test 2. We use the speech recognition scoring toolkit (SCTK) developed by the National Institute of Standards and Technology to compute the significance levels, which is based on two-proportion z-test comparing the difference between the recognition results of our proposed approach and the baseline. All the WER reductions are statistically significant. For our reference, we also compare the performance of using Functional Head Constraint to that of using inversion constraint in (Li andFung, 2012, 2013) and found that the present model reduces WER by 0.85% on Test 2 but gives no improvement on Test 1. We hypothesize that since  Test 1 has mostly Chinese words, the proposed method is not as advantageous compared to our previous work. Another future direction is for us to improve the lattice parser as we believe it will lead to further improvement on the final result of our proposed method.

Conclusion
In this paper, we propose using lattice parsing and partial parsing to incorporate a well-known syntactic constraint for code mixing speech, namely the Functional Head Constraint, into a continuous speech recognition system. Under the Functional Head Constraint, code switch cannot occur between the functional head and its complements. Since code mixing speech data is scarce, we propose to instead learn the code mixing language model from bilingual data with this constraint. The constrained code switching language model is obtained by first expanding the search network with a translation model, and then using parsing to restrict paths to those permissible under the constraint. Lattice parsing enables a sequential coupling of parsing then constraint filtering whereas partial parsing enables a tight coupling between parsing and filtering. A WFST-based decoder then combines a bilingual acoustic model and the proposed code-switch language model in an integrated approach. Lattice-based parsing and partial parsing are used to provide the syntactic structure of the matrix language. Matrix words at the leave nodes of the syntax tree are permitted to switch to the embedded language if the switch does not vio-late the Functional Head Constraint. This reduces the permissible search paths from those expanded by the bilingual language model. We tested our system on a lecture speech dataset with 16% embedded second language, and on a lunch conversation dataset with 20% embedded second language. Our language models with lattice parsing and partial parsing reduce word error rates from a baseline mixed language model by 3.72% to 3.89% relative in the first task, and by 5.85% to 5.97% in the second task. They are reduced from an interpolated language model by 3.69% to 3.74%, and by 5.46% to 5.77% in the first and second task respectively. WER reductions from an adapted language model are 2.51% to 2.63%, and by 4.47% to 4.74% in the two tasks. The F-measure for code switch point detection is improved from 0.64 by the interpolated model to 0.68, and from 0.67 by the adapted model to 0.70 by our method. Our proposed approach avoids making early decisions on code-switch boundaries and is therefore more robust. Our approach also avoids the bottleneck of code switch data scarcity by using bilingual data with syntactic structure. Moreover, our method reduces word error rates for both the matrix and the embedded language.