Recognizing Reduplicated Forms: Finite-State Buffered Machines

Total reduplication is common in natural language phonology and morphology. However, formally as copying on reduplicants of unbounded size, unrestricted total reduplication requires computational power beyond context-free, while other phonological and morphological patterns are regular, or even sub-regular. Thus, existing language classes characterizing reduplicated strings inevitably include typologically unattested context-free patterns, such as reversals. This paper extends regular languages to incorporate reduplication by introducing a new computational device: finite state buffered machine (FSBMs). We give its mathematical definitions and discuss some closure properties of the corresponding set of languages. As a result, the class of regular languages and languages derived from them through a copying mechanism is characterized. Suggested by previous literature, this class of languages should approach the characterization of natural language word sets.


The Puzzle of (Total) Reduplication
Formal language theory (FLT) provides computational mechanisms characterizing different classes of abstract languages based on their inherent structures. Following FLT in the study of human languages, in principle, researchers would expect a hierarchy of grammar formalisms that matches empirical findings: more complex languages in such a hierarchy are supposed to be 1) less common in natural language typology; and 2) harder for learners to learn.
The classical Chomsky Hierarchy (CH) puts formal languages into four levels with increasing complexity: regular, context-free, context-sensitive, recursively enumerable (Chomsky, 1956;Jäger and Rogers, 2012). Does the CH notion of formal complexity have the desired empirical correlates? Several findings suggest that those four levels do not align with natural languages precisely, some leading to major refinements on the CH. First, the unbounded crossing dependencies in Swiss-German case marking (Shieber, 1985) facilitated attempts to characterize mildly context-sensitive languages (MCS), which extend context-free languages (CFLs) but still preserve some useful properties of CFLs (e.g., Joshi, 1985;Seki et al., 1991;Stabler, 1997). Secondly, it is generally accepted that phonology is regular (e.g. Johnson, 1972;Kaplan and Kay, 1994). However, being regular is argued to be an unrestrictive property for phonological well-formed strings: for example, a language whose words are sensitive to an even or odd number of certain sounds is unattested (Heinz, 2018). With strong typological evidence, the sub-regular hierarchy was further developed, which continues to be an active area of research (e.g., McNaughton and Papert, 1971;Simon, 1975;Heinz, 2007;Heinz et al., 2011;Chandlee, 2014;Graf, 2017).
In this paper, we analyze another mismatch between existing well-known language classes and empirical findings: reduplication, which involves copying operations on certain base forms (Inkelas and Zoll, 2005). The reduplicated phonological strings are either of total identity (total reduplication) or of partial identity (partial reduplication) to the base forms. Table 1 provides examples showing the difference between total reduplication and partial reduplication: in Dyirbal, the pluralization of nominals is realized by fully copying the singular stems, while in Agta examples, plural forms only copy the first CVC sequence of the corresponding singular forms (Healey, 1960;Marantz, 1982).
Reduplication is common cross-linguistically. According to Rubino (2013) and Dolatian and Heinz (2020), 313 out of 368 natural languages exhibit productive reduplication, in which 35 languages only have total reduplication, but not partial reduplication. As a comparison, it is widely recognized that context-free string reversals are rare in phonology and morphology (Marantz, 1982) and appear to be confined to language games (Bagemihl, 1989). Unrestricted total reduplication, or unbounded copying, can be abstracted as L ww = {ww | w ∈ Σ * }, a well-known non-context free language (Culy, 1985;Hopcroft and Ullman, 1979). 1 Its noncontext-freeness comes from the incurred crossing dependencies among symbols, similar to Swiss-German case marking constructions. However, the typologically-rare string reversals ww R demonstrate nesting dependencies, which are context-free (see Fig. 1 as an illustration).
Given most phonological and morphological patterns are regular, how can one fit in reduplicated strings without including reversals? Gazdar and Pullum (1985, 278) made the remark that 1 Total reduplication does not immediately guarantee unboundedness. When the set of bases is finite, i.e, {ww | w ∈ L} when L is finite, total reduplication can be squeezed in languages described by 1 way finite state machines (Chandlee, 2017), though doing so eventually leads to state explosion (Roark and Sproat, 2007;Dolatian and Heinz, 2020). Computationally, only total reduplication with infinite number of potential reduplicants is true unbounded copying. With careful treatment, unbounded copying, externalizing a primitive copying operation, can be justified as a model of reduplication in natural languages. More in-depth discussion of 1): bounded versus unbounded and 2): copying as a primitive operation can be found in Clark and Yoshinaka (2014); Chandlee (2017); Dolatian and Heinz (2020). r e e w w R We do not know whether there exists an independent characterization of the class of languages that includes the regular sets and languages derivable from them through reduplication, or what the time complexity of that class might be, but it currently looks as if this class might be relevant to the characterization of NL word-sets.
Motivated by Gazdar and Pullum (1985), this article aims to give a formal characterization of regular with copying languages. Specifically, it examines what minimal changes can be brought to regular languages to include stringsets with two adjacent copies, while excluding some typologically unattested context-free patterns, such as reversals, shown in Fig. 2. One possible way to probe such a language class is by adding copying to the set of operations whose closure defines regular languages. Instead, the approach we take in this paper is to add reduplication to finite state automata (FSAs), which compute regular languages.
Various attempts followed this vein: 2 one example is finite state registered machine in Cohen-Sygal and Wintner (2006) (FSRAs) with finitely many registers as its memory, limited in the way that it only models bounded copying. The state-ofart finite state machinery that computes unbounded copying elegantly and adequately is 2-way finite state transducers (2-way FSTs), capturing reduplication as a string-to-string mapping (w → ww) (Dolatian and Heinz, 2018a,b, 2019, 2020. To avoid the mirror image function (w → ww R ), Dolatian and Heinz (2020) further developed subclasses of 2-way FSTs which cannot output anything during right-to-left passes over the input (cf. rotating transducers: Baschenis et al., 2017).
It should be noted that the issue addressed by 2way FSTs is a different one: reduplication is modeled as a function (w → ww), while this paper focuses on a set of languages containing identical substrings (ww). The stringset question is non-trivial and well-motivated for reasons of both formal aspects and its theoretical relevance. Firstly, since the studied 2-way FSTs are not readily invertible, how to get the inverse relation ww → w remains an open question, as acknowledged in Dolatian and Heinz (2020). Although this paper does not directly address this morphological analysis problem, recognizing which strings are reduplicated and belong to L ww or any other copying languages may be an important first step. 3 As for the theoretical aspects, there are some attested forms of meaning-free reduplication in natural languages. Zuraw (2002) proposes aggressive reduplication in phonology: speakers are sensitive to phonological similarity between substrings within words and reduplication-like structures are attributed to those words. It is still arguable whether those meaning-free reduplicative patterns of unbounded strings are generated via a morphological function or not. Overall, it is desirable to have models that help to detect the substring identity within surface strings when those sub-strings are in the regular set.
3 Thanks to the reviewer for bringing this point up.
are two-taped finite state automata, sensitive to copying activities within strings, hence able to detect identity between sub-strings. This paper is organized as follows: Section 2 provides a definition of FSBMs with examples. Then, to better understand the copying mechanism, complete-path FSBMs, which recognize exactly the same set of languages as general FSBMs, are highlighted. Section 3 examines the computational and mathematical properties of the set of languages recognized complete-path FSBMs. Section 4 concludes with discussion and directions for future research.
2 Finite State Buffered Machine

Definitions
FSBMs are two-taped automata with finite-state core control. One tape stores the input, as in normal FSAs; the other serves as an unbounded memory buffer, storing reduplicants temporarily for future identity checking. Intuitively, FSBMs is an extension to FSRAs but equipped with unbounded memory. In theory, FSBMs with a bounded buffer would be as expressive as an FSRA and therefore can be converted to an FSA. The buffer interacts with the input in restricted ways: 1) the buffer is queue-like; 2) the buffer needs to work on the same alphabet as the input, unlike the stack in a pushdown automata (PDA), for example; 3) once one symbol is removed from the buffer, everything else must also be wiped off before the buffer is available for other symbol addition. These restrictions together ensure the machine does not generate string reversals or other non-reduplicative non-regular patterns.
There are three possible modes for an FSBM M when processing an input: 1) in normal (N) mode, M reads symbols and transits between states, functioning as a normal FSA; 2) in buffering (B) mode, besides consuming symbols from the input and taking transitions among states, it adds a copy of justread symbols to the queue-like buffer, until it exits buffering (B) mode; 3) after exiting buffering (B) mode, M enters emptying (E) mode, in which M matches the stored symbols in the buffer against input symbols. When all buffered symbols have been matched, M switches back to normal (N) mode for another round of computation. Under the current augmentation, FSBMs can only capture local reduplication with two adjacent, completely identical copies. It cannot handle non-local reduplication, nor multiple reduplication. Specifying G and H states allows an FSBM to control what portions of a string are copied. To avoid complications, G and H are defined to be disjoint. In addition, states in H identify certain special transitions. Transitions between two H states check input-memory identity and consume symbols in both the input and the buffer. By contrast, transitions with at least one state not in H can be viewed as normal FSA transitions. In all, there are effectively two types of transitions in δ.

Definition 2. A configuration of an FSBM
where u is the input string; v is the string in the buffer; q is the current state and t is the current mode the machine is in.
Definition 3. Given an FSBM M and x ∈ (Σ ∪ { }), u, w, v ∈ Σ * , we define that a configuration D 1 yields a configuration D 2 in M (D 1 M D 2 ) as the smallest relation such that: 4 • For every transition (q 1 , x, q 2 ) with at least one state of

Examples
In all illustrations, G states are drawn with diamonds and H states are drawn with squares. The special transitions between H states are dashed. L ww is the simplest representation of unbounded copying, but this language is somewhat structurally dull. For the rest of the illustration, we focus on the FSBM M 2 in Figure 4. M 2 recognizes the noncontext free {a i b j a i b j |i, j ≥ 1}. This language can be viewed as total reduplication added to the regular language {a i b j |i, j ≥ 1} (recognized by the FSA M 0 in Figure 5).
State q 1 is an initial state and more importantly a G state, forcing M 2 to enter B mode before it takes any arcs and transits to other states. Then, M 2 in B mode always keeps a copy of consumed input symbols until it proceeds to q 4 , an H state. State q 4 requires M 2 to stop buffering and switch to E mode in order to check for string identity. Using the special transitions between H states (in this case, a and b loops on State q 4 ), M 2 checks whether the stored symbols in the buffer matches the remaining input. If so, after emitting out all symbols in the buffer, M 2 with a blank buffer can switch to N mode. It eventually ends at State q 4 , a legal final state. Figure 6 gives a complete run of M 2 on the string "abbabb". Figure 7 shows M 2 rejects the non-total reduplicated string "ababb" since a final configuration cannot be reached.
Example 3. Partial reduplication Assume Σ = {b, t, k, ng, l, i, a}, the FSBM M 3 in Figure 8 serves as a model of two Agta CVC reduplicated plurals in Table 1.
Given the initial state q 1 is in G, M 3 has to enter B mode before it takes any transitions. In B mode, M 3 transits to a plain state q 2 , consuming an input consonant and keeping it in the buffer. Similarly, M 3 transits to a plain state q 3 and then to q 4 . When M 3 first reaches q 4 , the buffer would contain a CVC sequence. q 4 , an H state, urges M 3 to stop buffering and enter E mode. Using the special transitions between H states (in this case, loops on q 4 ), M 3 matches the CVC in the buffer with the remaining input. Then, M 3 with a blank buffer can switch to N mode at q 4 . M 3 in N mode loses the access to loops on q 4 , as they are available only in E mode. It transits to q 5 to process the rest of the input by the normal transitions between q 5 . A successful run should end at q 5 , the only final state. Figure 9 gives a complete run of M 3 on the string "taktakki".

Complete-path FSBMs
As shown in the definitions and the examples above, an FSBM is supposed to end in N mode to process an input. There are two possible scenarios for a run to meet this requirement: either never entering B mode or undergoing full cycles of N, B, E, N mode changes. The corresponding languages reflect either no copying (functioning as plain FSAs) or full copying, respectively.
In any specific run, it is the states that inform an FSBM M of its modality. The first time M reaches a G state, it has to enter B mode and keeps buffering when it transits between plain states. The first time when it reaches an H state, M is supposed to enter E mode and transit only between H states in E mode. Hence, to go through full cycles of mode changes, once M reaches a G state and switches to B mode, it has to encounter some H states later to be put in E mode. To allow us to only reason about only the "useful" arrangements of G and H states, we impose an ordering requirement on G and H states along a path in a machine and define a complete path.
Definition 5. A path s from an initial state to a final state in a machine is said to be complete if 1. for one H state in s, there is always a preceding G state; 2. once one G state is in s, s must contain must contain at least one H following that G state 3. in between G and the first H are only plain states.
Schematically, with P representing those non-G, non-H plain states and I, F representing initial, final states respectively, the regular expression denoting the state information in a path s should be of the form: I(P * GP * HH * P * | P * ) * F . No H states When a G state does not have any reachable H state following it, there is no complete run, since M always stays in B mode.
No H states in between two G states When a G state q 0 has to transit to another G state q 0 before any H states, M cannot go to q 0 , for M would enter B mode at q 0 while transiting to another G state in B mode is ill-defined.
H states first When M has to follow a path containing two consecutive H states before any G state, it would clash in the end, because the transitions among two H states can only be used in E mode. However, it is impossible to enter E mode without entering B mode enforced by some G states.
It should be emphasized that M in N mode can pass through one (and only one) H state to another plain state. For instance, the language of the FSBM Used Arc State Info Configuration
(q 4 , k, q 4 ) q 4 ∈ H (ki, q 4 , , E) Normal triggered by q 4 and empty buffer 10. N/A (ki, q 4 , , N) 11. (q 4 , , q 5 ) (ki, q 5 , , N) 12. (q 5 , k, q 5 ) (i, q 5 , , N) 13. (q 5 , i, q 5 ) q 5 ∈ F ( , q 5 , , N) M 4 in Figure 10 is equivalent to the language recognized by the FSA in Figure 11. M 4 remains to be an incomplete FSBM because it doesn't have any G state preceding the H states q 2 and q 4 . The languages recognized by complete-path FS-BMs are precisely the languages recognized by general FSBMs. One key observation is the language recognized by the new machine is the union of the languages along all possible paths. Then, the validity of such a statement builds on different incomplete cases of G and H states along a path: they either recognize the empty-set language or show equivalence to finite state machines. Therefore, the language along an incomplete path of the machine is still in the regular set. Only a complete path containing at least one well-arranged G . . . HH * sequence uses the copying power and extends the regular languages. Therefore, in the next section, we focus on complete-path FSBMs.

Some closure properties of FSBMs
In this section, we show some closure properties of complete-path FSBM-recognizable languages and their linguistic relevance. Section 3.1 discusses its closure under intersection with regular languages; Section 3.2 shows it is closed under homomorphism; Section 3.3 briefly mentions union, concatenation, Kleene star. These operations are of special interests because they are regular operations defining regular expressions (Sipser, 2013, 64). That complete-path FSBMs are closed under regular operations leads to a conjecture that the set of languages recognized by the new automata is equivalent to the set of languages denoted by a version of regular expression with copying added.
Noticeably, given FSBMs are FSAs with a copying mechanism, the proof ideas in this section are similar to the corresponding proofs for FSAs, which can be found in Hopcroft and Ullman (1979) and Sipser (2013).

Intersection with FSAs
Theorem 1. If L 1 is a complete-path FSBMrecognizable language and L 2 is a regular language, then L 1 ∩ L 2 is a complete-path FSBM-recognizable language.
In nature, FSAs can be viewed as FSBMs without copying: they can be converted to an FSBM with an empty G set, an empty H set and trivially no special transitions between H states.
That FSBM-recognizable languages are closed under intersection with regular languages is of great relevance to phonological theory: assume a natural language X imposes backness vowel harmony, which can be modeled by an FSA M V H . In addition, this language also requires phonological strings of certain forms to be reduplicated, which can be modeled by an FSBM M RED . One hereby can construct another FSBM M RED+V H to enforce both backness vowel harmony and the total identity of sub-strings in those forms. Not limited to harmony systems, phonotactics other than identity of sub-strings are regular (Heinz, 2018), indicating almost all phonological markedness constraints can be modeled by FSAs. When FSBMs intersect with FSAs computing those phonotactic restrictions, the resulting formalism is still an FSBM but not other grammar with higher computational power. Thus, FSBMs can model natural language phonotactics once including recognizing surface sub-string identity. Figure 12: An FSBM M 5 on the alphabet {C, V } such that L(M 5 ) = h(L(M 3 )) with M 3 in Figure 8 3.2 Homomorphism and inverse alphabetic homomorphism Definition 7. A (string) homomorphism is a function mapping one alphabet to strings of another alphabet, written h : Σ → ∆ * . We can extend h to operate on strings over Σ * such that 1) h( Σ ) = ∆ ; 2) ∀a ∈ Σ, h(a) ∈ ∆ * ; 3) for w = a 1 a 2 . . . a n ∈ Σ * , h(w) = h(a 1 )h(a 2 ) . . . h(a n ) where each a i ∈ Σ. An alphabetic homomorphism h 0 is a special homomorphism with h 0 : Σ → ∆.
Definition 8. Given a homomorphism h: Theorem 2. The set of complete-path FSBMrecognizable languages is closed under homomorphisms.
Theorem 2. can be proved by constructing a new machine M h based on M . The informal intuition goes as follows: relabel the odd arcs to mapped strings and add states to split the arcs so that there is only one symbol or on each arc in M h . When there are multiple symbols on normal arcs, the newly added states can only be plain non-G, non-H states. For multiple symbols on the special arcs between two H states, the newly added states must be H states. Again, under this construction, complete paths in M lead to newly constructed complete paths in M h .
The fact that complete-path FSBMs guarantee the closure under homomoprhism allows theorists to perform analyses at certain levels of abstraction of certain symbol representations. Consider two alphabets Σ = {b, t, k, ng, l, i, a} and ∆ = {C, V } with a homomorphism h mapping every consonant (b, t, k, ng, l) to C and mapping every vowel (i, a) to V . As illustrated by M 3 on alphabet Σ (Figure 8) and M 5 on alphabet ∆ (Figure 12), FSBMdefinable patterns on Σ would be another FSBMdefinable patterns on ∆.
We conjecture that the set of languages recognized by complete-path FSBMs is not closed under inverse alphabetic homomorphisms and thus inverse homomorphism. Consider a complete-path FSBM-recognizable language L = {a i b j a i b j | i, j ≥ 1} (cf. Figure 4). Consider an alphabetic homomorphism h : {0, 1, 2} → {a, b} * such that h(0) = a, h(1) = a and h(2) = b. Then, h −1 (L) = {(0|1) i 2 j (0|1) i 2 j | i, j ≥ 1} seems to be challenging for FSBMs. Finite state machines cannot handle the incurred crossing dependencies while the augmented copying mechanism only contributes to recognizing identical copies, but not general cases of symbol correspondence. 5

Other closure properties
Union Assume there are complete-path FSBMs M 1 and M 2 such that L(M 1 ) = L 1 and L(M 2 ) = L 2 , then L 1 ∪ L 2 is a complete-path FSBMrecognizable language. One can construct a new machine M that accepts an input w if either M 1 or M 2 accepts w. The construction of M keeps M 1 and M 2 unchanged, but adds a new plain state q 0 . Now, q 0 becomes the only initial state, branching into those previous initial states in M 1 and M 2 with -arcs. In this way, the new machine would guess on either M 1 or M 2 accepts the input. If one accepts w, M will accept w, too.
Concatenation Assume there are complete-path FSBMs M 1 and M 2 such that L(M 1 ) = L 1 and L(M 2 ) = L 2 , then there is a complete-path FSBM M that can recognize L 1 • L 2 by normal concatenation of two automata. The new machine adds a new plain state q 0 and makes q 0 the only initial state, branching into those previous initial states in M 1 with -arcs. All final states in M 2 are the only final states in M . Besides, the new machine adds -arcs from any old final states in M 1 to any possible initial states in M 2 . A path in the resulting machine is guaranteed to be complete because it is essentially the concatenation of two complete paths.
Kleene Star Assume there is a complete-path FSBM M 1 such that L(M 1 ) = L 1 , L * 1 is a complete-path FSBM-recognizable language. A new automaton M is similar to M 1 with a new initial state q 0 . q 0 is also a final state, branching into old initial states in M 1 . In this way, M accepts the empty string . q 0 is never a G state nor an H state. Moreover, to make sure M can jump back to an initial state after it hits a final state, -arcs from any final state to any old initial states are added.

Discussion and conclusion
In summary, this paper provides a new computational device to compute unrestricted total reduplication on any regular languages, including the simplest copying language L ww where w can be any arbitrary string of an alphabet. As a result, it introduces a new class of languages incomparable to CFLs. This class of languages allows unbounded copying without generating non-reduplicative nonregular patterns: we hypothesize context-free string reversals are excluded since the buffer is queue-like. Meanwhile, the MCS Swiss-German cross-serial dependencies, abstracted as {a i b j c i d j |i, j ≥ 1}, is also excluded, since the buffer works on the same alphabet as the input tape and only matches identical sub-strings.
Following the sub-classes of 2-way FSTs in Dolatian and Heinz (2018a,b, 2019, 2020, which successfully capture unbounded copying as functions while exclude the mirror image mapping, complete-path FSBMs successfully capture the total-reduplicated stringsets while exclude string reversals. Comparison between the characterized languages in this paper and the image of functions in Dolatian and Heinz (2020) should be further carried out to build the connection. Moreover, one natural next step is to extend FSBMs as acceptors to finite state buffered transducers (FSBT). Our intuition is FSBTs would be helpful in handling the morphological analysis question (ww → w), a not-yet solved problem in the 2-way FSTs that Dolatian and Heinz (2020) study. After reading the first w in input and buffering this chunk of string in the memory, the transducer can output for each matched symbol when transiting among H states.
Another potential area of research is applying this new machinery to Primitive Optimality Theory (Eisner, 1997;Albro, 1998). Albro (2000Albro ( , 2005 used weighted finite state machine to model constraints while represented the set of candidates by Multiple Context Free Grammars to enforce basereduplicant correspondence (McCarthy and Prince, 1995). Parallel to Albro's way, given completepath FSBMs are intersectable with FSAs, it is possible to computationally implement the reduplica-tive identity requirement by complete-path FSBMs without using the full power of mildly context sensitive formalisms. To achieve this goal, future work should consider developing an efficient algorithm that intersects complete-path FSBMs with weighted FSAs.
The present paper is the first step to recognize reduplicated forms in adequate yet more restrictive models and techniques compared to MCS formalisms. There are some limitations of the current approach on the whole typology of reduplication. Complete-path FSBMs can only capture local reduplication with two adjacent identical copies. As for non-local reduplication, the modification should be straightforward: the machines need to allow the filled buffer in N mode (or in another newly-defined memory holding mode) and match strings only when needed. As for multiple reduplication, complete-path FSBMs can easily be modified to include multiple copies of the same base form ({w n | w ∈ Σ * , n ∈ N}) but cannot be easily modified to recognize the nonsemilinear language containing copies of the copy ({w 2 n | w ∈ Σ * , n ∈ N}). It remains to be an open question on the computational nature of multiple reduplication. Last but not the least, as a reviewer points out, recognizing non-identical copies can be achieved by either storing or emptying not exactly the same input symbols, but mapped symbols according to some function f . Under this modification, the new automata would recognize {a n b n | n ∈ N} with f (a) = b but still exclude string reversals. In all, detailed investigations on how to modify complete-path FSBMs should be the next step to complete the typology.