Báo cáo khoa học: "NEW FRONTIERS BEYOND CONTEXT-FREENESS: DI-GRAMMARS AND DI-AUTOMATA".pdf (NEW FRONTIERS BEYOND CONTEXT-FREENESS)

NEW FRONTIERS BEYOND CONTEXT-FREENESS: DI-GRAMMARS AND DI-AUTOMATA. Peter Staudacher Institut for Allgemeine und Indogermanische Sprachwissenschaft Universitat Regensburg Postfach 397 8400 Regensburg 1 Germany Abstract A new class of formal languages will be defined the Distributed Index Languages (DI-languages). The grammar-formalism generating the new class - the DI-grammars - cover unbound dependencies in a rather natural way. The place of DI-languages in the Chomsky-hierarchy will be determined: Like Aho's indexed Languages, DI-languages represent a proper subclass of Type 1 (contextsensitive languages) and properly include Type 2 (context-free languages), but the DI-class is neither a subclass nor a superclass of Aho's indexed class. It will be shown that, apart from DI-grammars, DI-languages can equivalently be characterized by a special type of automata - DI-automata. Finally, the time complexity of the recognition-problem for an interesting subclass of DI-Grammars will approximately be determined. I Introduction It is common practice to parse nested Wh-dependencies, like the classical example of Rizzi (1982) in (1), (1) Tuo fratello, [a cui]l mi domando [che storie]2 abbiano raccontato t 2 t 1, era molto preoccupato (Your Brother, [to whom] 1 I wonder [which stories] 2 they told t 2 t 1 was very troubled) using a stack mechanism. Under the binary branching hypothesis the relevant structure of (1) augmented by wh-stacks is as follows: (2) [a cui] 1 mi dornando Lpush ---tit11--~i--push ---~[t2,tll---1 [che storie]2abbiano V2[t2,tl] / \ vlIt21 PPItll / \ V~I] NP.It2Ipop IP°P ] raccontato t2 t1 Up to now it is unclear, how far beyond contextfreeness the generative power of a Type 2 grammar formalism is being extended if such a stack mechanism is grafted on it (assuming, of course, that an upper bound for the size of the stack can not be motivated). Fernando Pereira's concept of Extraposition Grammar (XG), introduced in his influential paper (Pereira, 1981; 1983; cf. Stabler, 1987) in order to delimit the new territory, can be shown to be inadequate for this purpose, since it is provable that the class of languages generable by XGs coincides with Type 0 (i,e. XGs have the power of Turing machines), whereas the increase of power by the stack mechanism is not even enough to generate all Type 1 languages (see below). In (2) an additional point is illustrated: the stack [t2,tl] belonging to V 2 has to be divided into the substacks [t2] and [tl], which are then inherited by the daughters V l and PP. For the PP-index tlis not discharged from the top of the V2-stack [t2,tl]. Generalizing to stacks of unlimited size, the partition of a stack among the inheriting subconstituents K 1 and K 2 of a constituent K 0 is as in (3) (3) K0 It1,...,tj,tj+l,...,tk] / Klltl,...,tjl 358 \ K2ltj+l,...,tkl If the generalization in (3) is tenable, the extension of context-free grmnmars (Vijay-Shanker and Weir, 1991, call the resulting formalism "linear indexed granunar" (LIG)) discussed by Gazdar in (Gazdar, 1988), in which stacks are exclusively passed over to a single daughter (as in (3.1)), is too weak. (3.1) a)K0[tl,..,,tk] / \ Kl[tl,...,t k] K2 Lflf2f3f4 K1 K2[tl,...,tk] Rflf2f3f4 i.e:index multiplication vs. b) KoItl,....,tk] / \ Stack-transmission by distribution, however, as in (3) suggests the definition of a new class of grammars properly containing the context-free class. 2 (4) Index-Percolation (i) in (ii) in Aho 's Indexed-Grammars Dl-Grammars M f l f 2f3f4 M f l f 2f3f4 / \ / \ L flf2 Rf3f4 index distribution The region in the Chomsky hierarchy occupied by the class of DI-languages is indicated in (5) (5) Ty~,-o D l = G r a m m a r s and DI-languages Type-l A DI-grammar is a 5-tupel G= (N,T,F,P,S), where N,T,S are as usual, F is a alphabet of indices, P is a set of rules of the following form .I2 .L4 .I3 J A h o ' s Indexed 1) (a) A --> o~ (b)A-->aBf~ ( c ) A f - > o ~ , (A, BeN; o~, Be(N~T)*;feF) Languages .U DI-Languagm Type-2 ¢ontexffxee The relation "= > " or "directly derives" is defined as follows: 2)o~ = > 1) if either i) = 5A/ndex ?, 8,y e (NF*uT)*, indexeF*, AeN, A --) BIB2...B n is a rule of form 1)(a) 8 = 8BlindexlB2index2...BnindexnT or ii) o~ = 8A/ndex y, 8,T e (NF*wT)*, index eF*, A e N , A --) B 1..Bkf.. .Bn is a rule of form 1)(b), fEF B = 5Blindexl...Bkfindexk...Bnindexn7 or iii) ct = 8Afindex y ,8 ,? e (NF*vT)*, index eF*, A e N , Af --* B1B2...Bn is a rule of form 1)(c), f e F B = 8BlindexlB2index2...Bnindexny (*) and index = indexlindex2...index n, and for Bi e T: indexi = ~ O.e. the empty word) (o~a) The reflexive and transitive closure *=> of => is defined as usual. Replacing (*) by"mdex i = index for Bie N, index i = for B i e T", changes the above definition into a definition of Aho's well known indexed grammars. How index-percolation differs in indexed and Di-grammars is illustrated in (4). where (5.1) L 1 = {anbncn; n_>_> 1} (5.2) L 2 = {ak, k = 2n, 0 < n} (5.3) L 3 = {WlW2...WnZlWn...ZnWlZn+lm(wn)m(Wn. 1) ...m(w2)m(wl); n.~l & wie{a,b} + (1.~i~n) & ZlZ2...ZnZn+ 1 e D 1} m(y) is the mirror image of y and D 1 is the Dyck language generated by the following CFG Gk (DI=L(Gk)), Gk = ({S},{[,I},Rk, S), where R k = {S -~ [S], S ~ SS, S -~ ~} (5.4) L 4 = {ak; k = n n, n.~>l}; (L4 is not an indexed language, s. Takeshi Hayashi (1973)). By definition (see above), the intersection of the class of indexed languages and the class of DI-languages includes the context-free (err) languages. The inclusion is proper, since the (non-cfr) language L 1 is generated by G 1 = ({S,A,B}, {a,b,c}, {f,g}, R 1, S), where R 1 = {S -+ aAfc, A --, aAgc, A --, B, Bg --, bB, Bf -+ b}, and G 1 obviously is both a DI-gratmnax and and an indexed grammar,Like cfr. languages and unlike indexed languages, DI-languages have the constant growth property (i.e. for every DI-grammar G there exists a keN, s.th. for every weL(G), s.th. [wl>k, there exists a sequence w 1 359 (--w), w2,w3,...(wi•L(G)) , such that Iwnl < IWn+ll < (n+l)xlwl for every member w n of the sequence). Hence L2, and afortiori L4, is not a DI-language. But L2 is an indexed language, since it is generated by the indexed grammar G2 =({S,A,D}, {a}, {f,g}, R 2, S), where R 2 = {S --~Ag, A --->Af, A --->D, Df--~ DD, Dg --~ a}. L 3 is a DI-language, since it is generated by the DIgrammar G3 -- ({S,M,Z},{a,b,[,]},{f,g},R3,S) where R3 = { S ~ aSfa, M ~ [M], Z f ~ Za, Zg ~ Zb, S---~bSgb, M---~MM, Z f --~ a, Zg --~ b S --->M, M--~ Z } e.g, abb[b[ab]]bba (• L3) is derived as follows: S ~ aSfa ~ abSg/ba ~ abbSgg/bba ~ abbMgg/bba abb[Mgg]]bba = has distributed) been abb[MgMg/]bba (here the index "ggf' ~ abb[ZgMgl]bba abb[bMg/]bba ~ abb[b[Mg/]]bba ~ abb[b[Zg/]]bba abblb[Zfo]]bba ~ abblb[ab]]bba. 2.1 DI-Grammars and Indexed Grammars Considering the well known generative strength of indexed grammars, it is by no means obvious that L 3 is not an indexed language. In view of the complexity of the proof that L3 is not indexed, only some important points can be indicated - referring to the 3 main parts of every word x • L 3 by Xl, [Xm],Xr,aS illustrated in the example (6): (6) ab abb abbb abbbb[[abbbb[[abbb]abb]]ab]bbbbabbbabbaba LWIJ I.-W2Jl.~3J L.w4.--I L-~4--I I.~,3.--ILW2Jl-wIJ ' x 1,- J, [ Xm] = ii Xr i x Assume that there is a indexed grammar GI= (N,T,F,P,S) such that L3=L(GI): 1. Since G I can not be contextfree, it follows from the intercalation (or "pumping") lemma for indexed grammars proved by Takeshi Hayashi in (Hayashi, 1973) that there exists for GI an integer k such that for any x • L 3 such that Ixl>k a derivation can be found with the following properties: S =*=> zAfr/z'=*=> ZSlAf/.tfr/sI "z" =*=> zslrlAf#frirl'Sl'Z'=*=>zslrlBf#frlrl'Sl" z, =*=> zslrtlBfr/t l'r's 1 'z" =*=> x, (zz', rlr 1"• T*, Sltlt l's 1" • T +, f • F, ~t, r I • F*) By intercalating subderivations which can effectively be constructed this derivation can be extended as follows S =* => zAfqz'=*=>zs 1Af# fqs 1"z" =*=> zs 1...snA0C#')nfr}sn...s 1 "z" =*=> zs l...SnrnB(f# 3nfqrn'Sn... s 1 'z" =*=> ZSl...Snrntn...tlBfr/tl'...tn'rn'sn'.., s 1 'z' =*=> ZSl...Snrntn...tlwt 1 "...tn'rn'Sn'... s 1 "z" The interdependently extendible parts of x Sl...s n, tn...t 1, t l'...tn', rnrn', and Sn'... s 1", can not all be subwords of the central component [Xm] of x (or all be subwords of the peripheral components XlXr), else, [Xm] (or XlXr) could be increased and decreased independently of the peripheral components x 1 and x r (or of [Xm], respectively) of x , contradicting the assumption that x • L 3. Rather, the structure of x necessitates that .Sl...s n and Sn'...s 1' be subwords of XlXr and that the "pumped" index (f# 3n be discharged deriving the central component [Xm]. Thus, we know that for every/>0 there exists an index IX • F+, a x • L3, and a subword [Xm"] of the central part [Xm] of x such that [Xm']>l and M~t=*=>[Xm"] (M=B or the nonterminal of a descendant of A(f# 3nfo). To simplify our exposition we write Ixm'] instead of [Xm] and have (7) MIx =*=> [Xm] with the structure of x I and xr being encoded and stored in the index IX. 2. The balanced parentheses of [Xm] can not be encoded in the index Ix in (7) in such a manner that [Xm] is a homomorphic image of Ix. For the set I={Ix'; S=*=>XlMIx'xr =*=>Xl[Xm]Xr • L 3 } of all indices which satisfy (7) is regular (or of Type 3), but because of the Dyek-strueture of [Xm], LM={[Xm];Xl[Xm]Xr•L3} is not regular but essentially context-free or of Type 2. 3. In the derivation underlying (7) essential use of branching rules of the type A--~B1B2...B k (k_>.2)has to be made in the sense that the effect of the rules can not be simulated by linear rules. Else the central part [Xm] could only have linear and not trans-linear Dyck-structure. Without branching rules the required trans-linear parenthetical structure could only be generated by the use of additional index-introducing rules in (7), in order to "store" and coordinate parentheses, which, however, would destroy the dependence of [xm] from x I and xr• 4.For every n_>_>1, L 3 contains words w of the form (8) Wl..WkIIl..lIllwkllWk.lllllwk.211wk_alll..llll..lllm(wk)..m(w1) t-.n+l~ 360 t-.n+l--a where k=2 n, wie {a,b} + for luBt [v.tfto]v=+=>utWt [~fto]vt and Wt[ffto]ffi+=>wt The path Pt from Mix to w t contains n B-nodes, for k=2 n in (8). For every B-node Bj (0_ Ljt jgtolRjt ft ] and dices situated in Ix below (or on the right side of) ft. Consequently, ft has to be deleted on every such path Pj, before the appropriate indices become accessible, i.e. we get for every j with 0< jujRjt j t,,Jyj =* = > yjqtfto] (Bj,Rj,Cj eN,xj,o F*,ft , Thus, for n>lN[ in (8) (INI the cardinality of the nonterminal alphabet N of G I, ignoring, as before the constant amount of parenthesis-storing in nonterminals) because of [{Cj;0 UpRp[xpftO]vqRq[Xq fto]y and Rp[xpfto ] = * = > ypC[fto]z p = + = > ypZZp Rq[zqftO] = * = > yqC[fto]_Zo = + = > yqZZq, . (/a=xf~, a o , x a , o e F ,ft~F; M,Ru,Ra,C~r~; t . 1 , * ~ Up,Vq,y,yq,yp,Zp,Zq,Z~T ) + where z~{ZlWl...ZrWrZr+l; wi~{a,b} & Zl...Zr+l~D 1 (= the Dyck-language from (5.3)}. I.e. G I generates words w" =Xl"[Xm"]Xr", the central part of which contain a duplication (of "z" in [Xm"]=ylzY2zy3) without correspondence in Xl" or Xr", thus contradicting the general form of words of L 3. Hence L 3 is not indexed. 2.2 DI-Grammars and Linear Indexed Grammars I As already mentioned above, Gazdar in (Gazdar, 1988) introduced and discussed a grammar formalism, afterwards (e.g. in (Weir and Joshi, 1988)) called linear indexed granunars (LIG's), using index stacks in which only one nonterminai on the right-hand-side of a rule can inherit the stack from the left-hand-side, i.e. the rules of a LIG G=(N,T, F, P, S) with N,F,T,S as above, are of the Form i. A[..] ~A1U...Ai[..]..~I n ii. A[..] -~AI[]...Ai[f..]...A n iii. A [f..]-~A 1[]...Ai[..]...An iv. All ~ a whereA 1,...,AneN, feF, and aeT~{e}. The "derives"relation => is defined as follows Lj [xjfta] =* = > u j + 1Bj+ l[%j+lfto]vj + 1 (Bj,Bj + 1,Lj,Rj ~ N, xj,xj+ 1,(IeF*,ft EF,uj + 1,vj + I e {a, b,[,]}*) o(A[fl.. fn] ~=>o~i H...A[fl.. fn]..-~n[]~ Every path Pj branching off from Pt at Bj[xjfto] leads to a word wj derived exclusively by discharging wi-in- IThanks to the anonymous referees for suggestions for this section and the next one. if A[..] -~A l[]...Ai[..]..,4n~P 361 cxA[/'1.. fn] 13=>~-//1[]...A [ffl.. fn]...A n[] 13 ff A[..] ---)Al[]..Ai[f..]...AneP ~4 [ffl.. fn]13=>aA 1[1..-/1Ill.. fn].. ~ln[]13 tion (GC-CCG for short) in the sense of (Weir and Joshi, 1988) is not so easy to determine. If for each n~_>l composition rules of the form ff A[f..] -+A l[]...Ai[..]...AneP (x/y) (...(Yllzll)12....Inzn)--, (...(XllZll)12....Inzn) and (...(YllZll)12....Inzn) (x\y)~ (...(XllZll)12....Inzn) c~,4[]13=>~al3 if A[] --+aeP =*---> is the reflexive and transitive closure of =>, and L(G)={w; weT* & S[]=*=>w}. Gazdar has shown that LIGs are a (proper) subclass of indexed grammars. Joshi, Vijay-Shanker, and Weir (Joshi, Vijay°Shanker, and Weir, 1989; Weir and Joshi, 1988) have shown that LIGs, Combinatory Categorial Grammars (CCG), Tree Adjoinig Grammars (TAGs), and Head Grammars (HGs) are weakly equivalent. Thus, ff an inclusion relation can be shown to hold between DI-languages (DIL) and LILs, it simultaneously holds between the DIL-class and all members of the family. To simulate the restriction on stack transmission in a LIG GI=(N1,T, FI, P1, S1) the following construction of a DI-grammar Gd suggests itself: Let Gd =(N, T, F, P, S) where N-{S}={X'; XeN1}, F={f'; fEF1}~{#}, and P={S--+SI'#} u{A'--~A l'#...Ai'...An'#; A[..] ---~A1[]..Ai[..]...An~PI} ~{A'-+A 1 "#...Ai'f'... An'#;A [..] ---~A1[].-Ai[f..].-An~P1} u { A ' f ' ~ A I'#...Ai'...An'#;A[f..]-~A l[].-Ai[..]'tlnePl} ~{A'#---~a; A[] --~a~Pl} It follows by induction on the number of derivation steps that for X'eN, X~NI, tt'~F*, tt~Fi*, and w ~T* (10) X'tt'#=*o=>w if and only if X[~t]=*Gl=>w where X'=h(X) and ~t'=h(10 (h is the homomorphism from (NIwFI)* into (NuF)* with h(Z)=Z'). For the nontrivial part of the induction, note that A'#~t" can not be terminated in G. Together with S=>S 1"# (I0) yields L(GI)=L(G). The inclusion of the LIG-class in the DI-class is proper, since L 3 above is not a LIG-language, or to give a more simple example: Lw= {analnla2nlbln2b2n2b n [ n = nl + n2} is according to (Vijay-Shanker, Weir and Joshi, 1987) not in TAL, hence not in LIL. But (the indexed langauge) Lw is generated by the DI-Grammar Gw=({ S,A,B },{a,b,al,a2,b 1,b2 },{ S--~aSIb,S--~AB,Af---~ a 1Aa2,Bf--~b1Bb2,Af--+ala2,Bf-+b lb2,A-+e,B"+e},S). 2.3 Generalized Composition and Combinatory Categorial Grammars The relation of DI-granunars to Steedman's Combinatory Categorial Grammars with Generalized Composi- are permitted, the generative power of the resulting grammars is known to be stronger than TAGs (Weir and Joshi, 1988). Now, the GC-CCG given by f(~)={#} f(a)={A,XkA} f(b)={B,YxB} f(al)={SDU#,SDG#, #/X/#,#/X~#} fCol)={S/Y/#,S/Y~#,#/YI#,#/~#} f(D ={K} f(])={#/#kK,~#~Z} generates a language Lc, which when intersected with the regular set {a,b}+{ [,],a 1,b 1}+{a,b} + yields a language Lp which is for similar reasons as L 3 not even an indexed language. But Lp does not seem to be a DI-language either. Hence, since indexed languages and DI-languages are closed under intersection with regular sets., Lc is neither an indexed nor (so it appears) a DI-language. The problem of a comparison of DI-grammars and GC-CCGs is that, inspite of all appearances, the combination of generalized forward and backward composition can not directly simulate nor be simulated by index-distribution, at least so it seems. 3 DI-Automata An alternative method of characterizing DI-languages is by means of DI-automata defined below. Dl-automata (dia) have a remote resemblance to Aho's nested stack automata (nsa). They can best be viewed as push down automata (pda) with additional power: they can not only read and write on top of their push down store, but also travel down the stack and (recursively) create new embedded substacks (which can be left only after deletion), dia's and nsa's differ in the following respects: 1. a dia is only allowed to begin to travel down the stack or enter the stack reading mode, if a tape,symbol A on top of the stack has been deleted and stored in a special stack-reading-state qA, and the stack-reading mode has to be terminated as soon as the first indexsymbol f from above is being scanned, in which case the index-symbol concerned is deleted and an embedded stack is created, provided the transition-function gives permission. Thus, every occurrence of an indexsymbol on the stack can only be "consumed" once, and 362 only in combination with a "matching" non-index-symbol. A nsa, on the other hand, embeds new stacks behind tape symbols which are preserved and can, thus, be used for further stack-embeddings. This provides for part of the stack multiplication effect. 2. Moving through the stack in the stack reading mode, a dia is not allowed to pass or skip an index symbol. Moreover, no scanning of the input or change of state is permitted in this mode. A nsa, however, is allowed both to scan its input and change its state in the stack reading mode, which, together with the license to pass tape symbols repeatedly, provides for another part of the stack multiplication effect. 3. Unlike a nsa, a dia needs two tape alphabets, since only "index symbols" can be replaced by new stacks, moreover it requires two sets of states in order to distinguish the pushdown mode from the stack reading mode. Formally, a di-automaton is a 10-tuple D ={q, Q17 T,F, Z~,z~s,¢,#), where q is the control state for the pushdown mode, QI-={qA; A e,/"} a finite set of stack reading states, T a finite set of input symbols, / ' a finite set of storage symbols, I a finite set of index symbols where Ir-d"=~, Z o e F i s the initial storage symbol, $ is the top-of-stack marker on the storage tape, ¢ is the bottom-of embedded stack marker on the storage tape, # marks the bottom o f the storage tape, where $,¢,# f~F~ T~I, Dir = {-1,0,1} (for "1 step upwards","stay","l step downwards", respectivly, E = {0,1} ("halt input tape", "shift input tape", respectively), T'= T u {#}, l ' = F u {¢}, d~is a mapping 1) in the push down mode: from {q} x T' x SFinto finite subsets of {q} x O x $1"((FuI) *) 2) in the stack reading mode: for everyA ~/" (a)from {qA} x 7" x 1-" into subsets of {qA} x {0} x {1} (for walking down the stack) (b)from {q} xT' x $(A} into subsets of (qA} x (0} x {1} (for initiating the stack reading mode) (c) from {q} x T' x {,4} into subsets of {q} x {0} x (-1} (for climbing up the stack) 3) in the stack creation mode: from Q F x T' x I into finite subsets of {q} x {0} x $F((l"u1) *)¢, and from Q F x T' x $1 into finite subsets of {q} x {0} x $$F((F~l)*)¢ (for re- placing index symbols by new stacks, preserving the top-of-stack marker $) 4) in the stack destruction mode: from {q} x T' x {$¢} into subsets of {q} x {0}. As in the case of Aho's nested stack automaton a configuration of a DI-automaton D is a quadruple (P,al....an#,i,X1...AXj...Xm), where 1. p e {q}UQF is the current state of D; 2. al...a n is the input string, # the input endmarker; 3. i (ll XI=$A, AeF, Xm=#, X2...Xm. 1 e (F~ Iw{$,¢})*; Xj is the stack symbol currrently being read by the storage tape head. If m=l, then Xm--$#. As usual, a relation I'D representing a move by the automaton is defined over the set of configurations: (i)(q, al...an#,i,oc$^AYI3) ~)(q, al...an#,i+d,oc$^Z1...ZkYI3), if (q,d,$Z1...Zk) ES(q, ai,$A). (ii)(P,al...an#,i,X1...^Xj...Xm) I'D(qA, al...an#,i,X1..Xi^Xj+l...Xm), if, (qA, O,1) eS(P, ai,Xj) , where either Xj=$A and p=q, or Xj*SA(Aer) andP=qA; (iii)(q,al...an#,i,Xl...^Xj...Xm) ~D (q,al...an#,i,Xl...Xj_lO$^Al...AkCXj+l...Xm), if (q,0,$Al...Ak¢)eS(q,ai,Xj), whereXjeI and O=e,or Xj=SF(FeI)and 0=$; (iv)(q,al".an#,i,Xl.--Xj-l$^¢Xj+l...Xm) I'D (q,al...an#,i,X l...^Xj.IXj+l...Xm), if (q,0) eS(q, ai,$^¢). I'D* is the reflexive and transitive closure of ~'D.. N(D) or the language accepted by empty stack by D is defined as follows N(D)={w; weT* & (q,w#,l,$^Z0 #) I'D* (q,w#,lwl+l,$^#) To illustrate, the DI-automaton DI 3 accepting L 3 by empty stack is specified: DI 3 = (q (state for pda-mode), (QF =) {q~qM, qz, q$} (states for stack reading mode),('/'=) {a,b,[,]} (---input alphabet), (G=){S,M,Z,a,b,[,],} (--tape symbols for Ixtamode),(l=){f,g} (--tape symbols representing indices), 5,S,S,¢,#) where for every x e T: 8(q,x,$S) = {(q, O,SaSfa), (q, O,$bSgb), (q, O,CM),), (for the G3-ndes: S --~ aSfa, S ~bSgb, S ~ M) 8(q,x,$M) = ((q, O,S[MJ), (q, O,SMM), (q, O,SZ),}, 363 (for: M-+[M], M-+MM, M---~Z) 8(q,x,$x) {(q,1,$)} (i.e.: if input symbol x = "predicted" terminal symbol x, then shift input-tape one step ("1") and delete successful prediction" (replace Sx by $)) 8(q,x,$Z) contains {(qz, 0,$)}, (i.e.: change into stack reading mode in order to find indices belonging to the nonterminal Z) 5(qz, x,$Y) = 5(qz, x,Y) contain {(qz, O,1)} (for every x T, Y ~ / ) (i.e.seek first index-symbol belonging to Z inside the stack) 5(qz, x, $J9 = {(qz, o, $$Za¢), (qz, O,$$a¢)), 5(qz, x, Sg) = {(qz, O,SSZb¢),(qz, 0,$$b¢)}, 5(qz, xJ) = {(q,x, $Za¢), (q,x, SAC)}, 5(qz, x,g) = {(q,x, SZb¢), (q,x, Sb¢)}, (i.e. simulate the index-rules Zf~Za, Z f ~ a by creation of embedded stacks) 5(q,x,S¢) = {(q, O)}, (i.e. delete empty sub-stack) 8(q,x,Y) = {(q,O,-1)} (forx ~ T, Y ~ G-~g}) (i.e. move to top of (sub-)stack). . = The following theorem expresses the equivalence of DIgrammars and DI-automata (11) DI-THEOREM: L is a Dl-language (i.e. L is generated by a Dl-grammar) if and only if L is accepted by a Dl-automaton. Proof sketch: I. "only irk(to facilitate a comparison this part follows closely Aho's corresponding proof for indexed grammars and nsa's (theorem 5.1) in (Aho, 1969)) If L is a DI-language, then there exists a DI-grammar G=(N,T,F,P,S) with L(G)=L. For every DI-grammar an equivalent DI-grammar in a normal form can be constructed in which each rule has the form A---~BC, A-~a, A---~Bf or Af---~B, with A•N; B,C•(N-{S}), a•T, f~F; and e • L(G), only if S---~e is in P. (The proof is completely analogous to the corresponding one for indexed grammars in (Aho, 1968) and is therefore omitted). Thus, we can assume without loss of generality that G is in normal form. A DI-automaton D such that N(D)=L(G) is constructed as follows: Let D=(q, Q17T, IS,l,d,,Z~$,¢,#), with T=N~T~{$,¢,#}, QI~{qA;Ae2-~, I=F, Zo=S where ~ is constructed in the following manner for all a•T: . 3. 4. 5. . 7. 8. (q,O,$BJ) ~ ~(q,a,$A), ifA--~Bf e P (q,1,$) ~ 8(q,a, Sa) (qA, O,$) • d(q,a,$A) for all A • F, (qA, O,1) • 8(qA, a,B) for all A • F and all B•F, i.(q,O,$B¢) • 8(qA,a,J) and ii.(q,O,$$B¢) • 8(qA, a,$J) for all A • F with Af-+B • P, (q,O) • d(q,a,$¢) (q,O,-1) e 8(q,a,B) for all B • F~{¢} (q,O,$) e 8(q,#,$S) ffand only ff S ~ s is in P. LEMMA 1.1 If (i) Afl...fk =n=> al...a m is a valid leflmost derivation in G with 1~0, n~>l and AeN, then for n,~.l, Zl31...l~ke(N~{¢})*, o~ (N~{$,¢}) ,~t~(N~F~{¢}) : (ii) (q, al...am#, 1,o~$^AZI5lfl...lSkfklt#) [-D*(q, al...am#,m+l,o~$^Z151...15k~t#). Proof by induction on n (i.e. the number of derivation steps): If n=l, then (i) is of the form A=>a where a c T and k=0, since only a rule of the form A--+a can be applied because of the normal form of G and since in DI-grammars (unlike in indexed grammars) unconsumed indices can not be swallowed up by terminals. Because of the construction of 5, (ii) is of the form (q,a#, 1,~$^AZ~ 1...[3k~t#) = (q,a#, 1,~$^AZlx#) [-D(q,a#, 1,ot$AaZ~t#) [-D(q,a#,2,ot$^ZIx#) =(q,a#,2,°t$^ZI 31...13k~t#) Suppose Lemma 1.1 is true for all n 1. A lethnost derivation Afl...fk =n'=> al...a m can have the following three forms according as A is expanded in the first step: 1)Afl...~+ 1...fk =>Bfl.:.~Cfi+ 1...fk =tll=>al...aiC~+l..-fk =n2=>al...aiai+l...a m with nla i ...am with n lal...am with nlal ...am . 2')(q,a 1...am#, 1,al ...a mis a lefanost derivation in G, then the following transition of D is valid (q, al...am#,l,o~$^A[3 lfl...~kfklt#) I'D (qA,al...am#, 1,~$^ZO lfl...Okfklt#) I'D* (qA, al...am#,l,~$Zl31^fl...~kfkg #) I'D (q,al...am#, 1,~$Z~ 1$^B¢ ~2f2...13kfkll#) I'D* (q,al...am#,m+l,~$Z~l$^¢~2...[~klt#) ~D (q, al...am#,m+l,c~$Z~^X[}2...[~k) t#) ~D* (q,al...am#,m+ 1,o~$^Z~X~2..-~kU#) where oX=~ 1. LEMMA1.2 If for Z~l...~ke,(N~{¢})* , ~x~(N~{$,¢})*, m_~>l, lt~(NuF~{¢})" (q,al...am#,l,o~S^AZI3 lfl...13kfkll# ) i'D* (q,al...am#,m+l,~$^l~l...13k~t#) and }'D* (q,al...am#,m+ 1,°~$^Z~1...13kl~#) then for all ~ 1 Af 1. ..fk=*=>al ...a m . The proof (by induction on n) is similar to the proof of Lcmma 1.1 and is, therefore, omitted. II.("iP) If L is accepted by a DI-automaton D=(q, Q F Z F d,L,Z6$,¢,#), then we can assume without loss of generality a) that D writes at most two symbols on a stack in eifiler the push down mode or the stack creation mode (it follows from the Di-automaton definition that the first one of the two symbols cannot be a index symbol from The proofs by induction of 1.1 and H.2 (unlike the proofs of the corresponding lemmata for nsa's and indexed grammars (s.Aho, (1969)) are as elementary as the one given above for I. 1 and are omitted. The DI-automaton concept can be used to show the inclusion of the class of DI-languages in the class of context-seusitive languages. The proof is strucuraUy very similar to the one given by Aho (Aho, 1968) for the inclusion of the indexed class in the context-sensitive class: For every DI-automaton A, an equivalent DIautomaton A" can be constructed which accepts its input w ff and only i r a accepts w and which in addition uses a stack the length of which is bounded by a linear function of the length of the input w. For A" a linear bounded automaton M (i.e the type of automaton characteristic of the context-sensitive class) can be constructed which simulates A : For reasons of space the extensive proof can not be given here. Some Remarks on the Complexity of DI-Recognition I), b) that T and F are disjunct. A DI-grammar G with L(G)=N(D)=L can be constructed as follows: Let G=tN,F,T,P,S) with N=F, F=I, S=Z0. P contains for all aeT, A,B,CeN, and f e F the productions (da=a, if d=1, else da=6) I. 2. 3. 4. 5. A--+daBC, A-+daBf, A->daB , A=-)da, Af-)BC, if if if if if 6. Af--,B, if (q,d,$BC)eigq, a,$A), (q.a.$BsO e #(q.a.$A), (q,d,$B) eS(q,a,$A), (q, a, $) e ,~q, a, $A) , (q, O,$B C ¢) e 6(qA, a,J) or (q, O,$$BC¢) e d(qA,a,$./) (q, O,$ B ¢) ~ d(qA,a,j9 or (q, O,$$B¢) ~ d(qA, a,$f) For all n~.l, m~l, ]31...13ke(NuJ{¢})* , ¢ze(N~{$,¢})*,* * fl,f2,....,fk~F, AeN, lie(NuFv{¢}) , and al...am~T II.1 and II.2 is true: II. 1: If (q,a 1...am#, 1,~$^A[3 lfl... ~kfkl~#) I'Dn (q, al .--am#,m+l,~x$^~ 1..-~k~t#) The time complexity of the recognition problem for DIgrammars will only be considered for a subclass of DIgrammars. As the restriction on the form of the rules is reminiscent of the Chomsky normal form for contextfree grammars (CFG), the grammars in the subclass will be called DI-Chomsky normal form (DI-CNF) grammars A DI-grammar G=(N,T,F,P,S) is a DI-CNF grammar ff and only ff each rule in P is of one of the following forms where A,B,C~N-{S}, feF, aeT, S--,a, ff 6~ L(G), (a) A--,BC, (b)A-BfC, (c) A-+BCf, (d) Af--)BC, (e)Af--,a, (0 A->a The question whether the class of languages generated by DI-CNF grammars is a proper or improper subclass of the DI-languages will be left open. In considering the recognition of DI-CNF grammars an extension of the CKY algorithm for CFGs (Kasami, 1965; Younger, 1967) will be used which is essentially 365 inspired by an idea of Vijay-Shanker and Weir in (Vijay-Shanker and Weir, 1991). Let the n(n+l)/2 cells of a CKY-table for an input of length n be indexed by i and j (l~_ai...a j is valid. Since nonterminal nodes of DI-derivation tr6es are labeled by pairs (A,~t) consisting of a nonterminal A and an index stack B and since the number of such pairs with (AdO =*=> w can grow exponentially with the length of w, intractability can only be avoided if index stacks can be encoded in such a way that substacks shared by several nodes are represented only once. Vijay-Shanker and Weir solved the problem for linear indexed grammars (LIGs) by storing for each node K not its complete label Aflf2...fn, but the nonterminal part A together with only the top fl of its index stack and an indication of the cell where the label of a descendant of K can be found with its top index f2 continning the stack of its ancestor K. In the following this idea will be adopted for DI-grammars, which, however, require a supplementation. Thus, if the cell Z i ; of the CKY-table contains an entry beginning with ~l", then we know that Att=*=>ai_.a j with tt--fltt 1 eF* is valid, and further that the top index symbol f2 on Bl(i.e. the continuation o f f l ) is in an entry ofceU Zp~q beginning with the noterminal B. If, descending in such a manner and guided by pointer quadruples like "" , an entry of the form is met, then, in the case of a LIG-table, the bottom of stack has been reached. So, entries of the form are sufficient for LIGs. But, of course, in the case of DI-derivatious the bottom of stack of a node, because of index distribution, does not coincide with the bottom of stack of an arbitrary index inheriting descendant, cf. Rather, the bottom of stack of a DI-node coincides with the bottom of stack of its rightmost index inheriting descendant. Therefore, the pointer mechanism for DI-entries has to be more complicated. In particular, it must be possible to add an "intersemital" pointer to a sister path. However, since the continuation of the unary stack (like of Cft in ( ) ) of a node without index inheriting descendants is necessarily unknown at the time its entry is created in a bottom up manner, it must be possible to add an intersemital pointer to an entry later on. That is why a DI-entry for a node K in a CKY-ceU requires an additional pointer to the entry for a descendant C, which contains the end-of-stack symbol of K and which eventually has to be supplemented by an intersemital continuation pointer. E.g. the entry (14) < Bl,f2,(D,f3,p,q),(C,ft,r,s) > in Z i,j indicates that the next symbol f3 below f2 on the index stack belonging to B 1 can be found in cell ZO q in the entry for the nonterminal D; the second ~l{~druple (C,ft r,s) points to the descendant C of Blcarrying the last ~ndex ft of Bland containing a place where a continuation pointer to a neighbouring path can be added or has already been added. To illustrate the extended CKY-algorithm, one of the more complicated cases of specifying an entry for the cell Zi, j is added below which dominates most of the other cases in time complexity: FOR i:=n TO 1 DO FORj:=i TO n DO FOR k:=i TO j-1 DO For each rule A ~ AlfA2: if eZi,k for some B1, CleN,fl,f3eF, Pl, ql (i- eZk+l, j for some fceF then 1. if eZpl,qlfor some B2,eN, f2eF, P2, q2 with i~.p2} else /I Cft if eLpl,ql Zij:=Zijw{ } 2. if ¢Lsl,tl 366 then Zsl,tl := Zsl,tlw{ } The pointer (A2,fc,k+lj) in the new entry of Zij points to the cell of the node where the end of stack of the newly created node with noterminal A can be found. The same pointer (A2,fc,,k+lj) appears in cell Zsl,t 1 as "supplement" in order to indicate where the stack of A is continued behind the end-of-stack of A 1. Note that supplemented quadruples of a cell Zi,j are uniquely identifiable by their form , i.e. the empty fourth component, and by the relation j<_r~s. Supplemented quadruples cannot be used as entries for daughters of "active" nodes, i.e. nodes the entries of which are currently being constructed. Let al...a n be the input. The number of entries of the form (fl,f2,f3eF, B,C, D~ N, l
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Báo cáo khoa học: "NEW FRONTIERS BEYOND CONTEXT-FREENESS: DI-GRAMMARS AND DI-AUTOMATA"

Nội dung