Handbook of algorithms for physical design automation part 58.pdf (Sổ tay thuật toán)

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 552 Finals Page 552 29-9-2008 #19 Handbook of Algorithms for Physical Design Automation Rmin v1 v v2 v3 … T (V1) vk FIGURE 26.14 If α1 and α2 satisfy the condition in Definition 1 at v1 , α2 is redundant. (From Shi, W. and Li, Z., IEEE Trans Computer-Aided Design, 24, 879, 2005. With permission.) node being processed. However, all candidates at this node must be propagated further upstream toward the source. This means the load seen at this node must be driven by some minimal amount of upstream wire or gate resistance. By anticipating the upstream resistance ahead of time, one can prune out more potentially inferior candidates earlier rather than later, which reduces the total number of candidates generated. More specifically, assume that each candidate must be driven by an upstream resistance of at least Rmin . The pruning based on anticipated upstream resistance is called predictive pruning. Definition 1 Predictive Pruning. Let α1 and α2 be two nonredundant candidates of T (v) such that C(α1 ) < C(α2 ) and Q(α1 ) < Q(α2 ). If Q(α2 ) − Rmin · C(α2 ) ≤ Q(α1 ) − Rmin · C(α1 ), then α2 is pruned. Predictive pruning preserves optimality. The general situation is shown in Figure 26.14. Let α1 and α2 be candidates of T (v1 ) that satisfy the condition in Definition 1. Using α1 instead of α2 will not increase delay from v to sinks in v2 , . . . , vk . It is easy to see C(v, α1 ) < C(v, α2 ). If Q at v is determined by T (v1 ), we have Q(v, α1 ) − Q(v, α2 ) = Q(v1 , α1 ) − Q(v1 , α2 ) − Rmin · [C(v1 , α1 ) − C(v1 , α2 )] ≥ 0 Therefore, α2 is redundant. Predictive pruning technique prunes more redundant solutions while guarantees optimality. It is one of four key techniques of fast algorithms proposed in Ref. [39]. In Ref. [42], significant speedup is achieved by simply extending predictive pruning technique to buffer cost. Aggressive predictive pruning technique, which uses a resistance larger than Rmin to prune candidates, is proposed in Ref. [43] to achieve further speedup with a little degradation of solution quality. 26.5.3 CONVEX PRUNING The basic data structure of van Ginneken’s algorithms is a sorted list of nondominated candidates. Both the pruning in van Ginneken’s algorithm and the predictive pruning are performed by comparing two neighboring candidates a time. However, more potentially inferior candidates can be pruned out by comparing three neighboring candidate solutions simultaneously. For three solutions in the sorted list, the middle one may be pruned according to convex pruning. Definition 2 Convex Pruning. Let α1 , α2 and, α3 be three nonredundant candidates of T(v) such that C(α1 ) < C(α2 ) < C(α3 ) and Q(α1 ) < Q(α2 ) < Q(α3 ). If Q(α3 ) − Q(α2 ) Q(α2 ) − Q(α1 ) < C(α2 ) − C(α1 ) C(α3 ) − C(α2 ) (26.25) then we call α2 nonconvex, and prune it. Convex pruning can be explained by Figure 26.15. Consider Q as the Y -axis and C as the X-axis. Then candidates are points in the two-dimensional plane. It is easy to see that the set of nonredundant Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 553 29-9-2008 #20 553 Buffer Insertion Basics q q4 q q4 q3 Pruned q3 q2 q1 (a) q1 c1 c2 c3 c4 c (b) c1 c3 c4 c FIGURE 26.15 (a) Nonredundant candidates N(v) and (b) nonredundant candidates M(v) after convex pruning. (From Li, Z. and Shi, W., IEEE Trans Computer-Aided Design, 25, 484, 2006. With permission.) candidates N(v) is a monotonically increasing sequence. Candidate α2 = (Q2 , C2 ) in the above definition is shown in Figure 26.15a, and is pruned in Figure 26.15b. The set of nonredundant candidates after convex pruning M(v) is a convex hull. For two-pin nets, convex pruning preserves optimality. Let α1 , α2 , and α3 be candidates of T (v) that satisfy the condition in Definition 2. In Figure 26.15, let the slope between α1 and α2 (α2 and α3 ) be ρ1,2 (ρ2,3 ). If candidate α2 is not on the convex hull of the solution set, then ρ1,2 < ρ2,3 . These candidates must have certain upstream resistance R including wire resistance and buffer/driver resistance. If R < ρ2,3 , α2 must become inferior to α3 when both candidates are propagated to the upstream node. Otherwise, R > ρ2,3 which implies R > ρ1,2 , and therefore α2 must become inferior to α1 . In other words, if a candidate is not on the convex hull, it will be pruned either by the solution ahead of it or the solution behind it. Please note that this conclusion only applies to two-pin nets. For multipin nets, when the upstream could be a merging vertex, nonredundant candidates that are pruned by convex pruning could still be useful. Convex pruning of a list of nonredundant candidates sorted in increasing (Q,C) order can be performed in linear time by Graham’s scan. Furthermore, when a new candidate is inserted to the list, we only need to check its neighbors to decide if any candidate should be pruned under convex pruning. The time is O(1), amortized over all candidates. In Refs. [40,41], the convex pruning is used to form the convex hull of nonredundant candidates, which is the key component of the O(bn2 ) algorithm and O(mn) algorithm. In Ref. [43], convex pruning (called squeeze pruning) is performed on both two-pin and multipin nets to prune more solutions with a little degradation of solution quality. 26.5.4 EFFICIENT WAY TO FIND BEST CANDIDATES Assume v is a buffer position, and we have computed the set of nonredundant candidates N (v) for T (v), where N (v) does not include candidates with buffers inserted at v. Now we want to insert buffers at v and compute N(v). Define Pi (v, α) as the slack at v if we add a buffer of type Bi for any candidate α: Pi (v, α) = Q(v, α) − R(Bi ) · C(v, α) − K(Bi ) (26.26) If we do not insert any buffer, then every candidate in N (v) is a candidate in N(v). If we insert a buffer, then for every buffer type Bi , i = 1, 2, . . . , b, there will be a new candidate βi : {Pi (v, α)} Q(v,βi ) = max α∈N (v) C(v,βi ) = C(Bi ) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 554 Finals Page 554 29-9-2008 #21 Handbook of Algorithms for Physical Design Automation Define the best candidate for Bi as the candidate α ∈ N (v) such that α maximizes Pi (v, α) among all candidates in N (v). If there are multiple α’s that maximize Pi (v, α), choose the one with minimum C. In van Ginneken’s algorithm, it takes O(bn) to find one best candidate at each buffer position. According to convex pruning, it is easy to see that all best candidates are on the convex hull. The following lemma says that if we sort candidates in increasing Q and C order from left to right, then as we add wires to the candidates, we always move to the left to find the best candidates. Lemma 1 For any T(v), let nonredundant candidates after convex pruning be α1 , α2 , . . . , αk , in increasing Q and C order. Now add wire e to each candidate αj and denote it as αj + e. For any buffer type Bi , if αj gives the maximum Pi (αj ) and αk gives the maximum Pi (αk + e), then k ≤ j. The following lemma says the best candidate can be found by local search, if all candidates are convex. Lemma 2 For any T(v), let nonredundant candidates after convex pruning be α1 , α2 , . . . , αk , in increasing Q and C order. If Pi (αj−1 ) ≤ Pi (αj ), Pi (αj ) ≥ Pi (αj+1 ), then αj is the best candidate for buffer type Bi and Pi (α1 ) ≤ · · · ≤ Pi (αj−1 ) ≤ Pi (αj ) Pi (αj ) ≥ Pi (αj+1 ) ≥ · · · ≥ Pi (αk ) With the above two lemmas and convex pruning, one best candidate is found in amortized O(n) time in Ref. [40] and O(b) time in Ref. [41],∗ which are more efficient than van Ginneken’s algorithm. 26.5.5 IMPLICIT REPRESENTATION Van Ginnken’s algorithm uses explicit representation to store slack and capacitance values, and therefore it takes O(bn) time when adding a wire. It is possible to use implicit representation to avoid explicit updating of candidates. In the implicit representation, C(v, α) and Q(v, α) are not explicitly stored for each candidate. Instead, each candidate contains five fields: q, c, qa, ca, and ra.† When qa, ca and, ra are all 0, q and c give Q(v, α) and C(v, α), respectively. When a wire is added, only qa, ca, and ra in the root of the tree [39] or as global variables themselves [41] are updated. Intuitively, qa represents extra wire delay, ca represents extra wire capacitance, and ra represents extra wire resistance. It takes only O(1) time to add a wire with the implicit representation [39,41]. For example, in Ref. [41], when we reach an edge e with resistance R(e) and C(e), qa, ra, and ca are updated to reflect new values of Q and C of all previous candidates in O(1) time, without actually touching any candidate: qa = qa + R(e) · C(e)/2 + R(e) · ca ca = ca + C(e) ra = ra + R(e) ∗ † In Ref. [40], Lemma 1 is presented differently. It says if all buffers are sorted decreasingly according to driving resistance, then the best candidates for each buffer type in such order is from left to right. In Ref. [41], only two fields, q and c, are necessary for each candidate. qa, ca, and ra are global variables for each two-pin segment. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 Finals Page 555 29-9-2008 #22 555 Buffer Insertion Basics The actual value of Q and C of each candidate α are decided as follows: Q(α) = q − qa − ra · c C(α) = c + ca (26.27) Implicit representation is applied on balance tree in Ref. [39], where the operation of adding a wire takes O(b log n) time. It is applied on a sorted linked list in Ref. [41], where the operation of adding a wire takes O(1) time. REFERENCES 1. J. Cong. An interconnect-centric design flow for nanometer technologies. Proceedings of IEEE, 89(4): 505–528, April 2001. 2. J. A. Davis, R. Venkatesan, A. Kaloyeros, M. Beylansky, S. J. Souri, K. Banerjee, K. C. Saraswat, A. Rahman, R. Reif, and J. D. Meindl. Interconnect limits on gigascale integration (GSI) in the 21st century. Proceedings of IEEE, 89(3): 305–324, March 2001. 3. R. Ho, K. W. Mai, and M. A. Horowitz. The future of wires. Proceedings of IEEE, 89(4): 490–504, April 2001. 4. A. B. Kahng and G. Robins. On Optimal Interconnections for VLSI. Kluwer Academic Publishers, Boston, MA, 1995. 5. J. Cong, L. He, C. -K. Koh, and P. H. Madden. Performance optimization of VLSI interconnect layout. Integration: The VLSI Journal, 21: 1–94, 1996. 6. P. Saxena, N. Menezes, P. Cocchini, and D. A. Kirkpatrick. Repeater scaling and its impact on CAD. IEEE Transactions on Computer-Aided Design, 23(4): 451–463, April 2004. 7. J. Cong. Challenges and opportunities for design innovations in nanometer technologies. SRC Design Sciences Concept Paper, 1997. 8. M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming: Theory and Algorithms. John Wiley & Sons, NY, 1993. 9. C. J. Alpert and A. Devgan. Wire segmenting for improved buffer insertion. In Proceedings of the ACM/IEEE Design Automation Conference, Anaheim, CA, pp. 588–593, 1997. 10. C. C. N. Chu and D. F. Wong. Closed form solution to simultaneous buffer insertion/sizing and wire sizing. ACM Transactions on Design Automation of Electronic Systems, 6(3): 343–371, July 2001. 11. L. P. P. P. van Ginneken. Buffer placement in distributed RC-tree networks for minimal Elmore delay. In Proceedings of the IEEE International Symposium on Circuits and Systems, New Orleans, LA, pp. 865–868, 1990. 12. J. Lillis, C. K. Cheng, and T. Y. Lin. Optimal wire sizing and buffer insertion for low power and a generalized delay model. IEEE Journal of Solid-State Circuits, 31(3): 437–447, March 1996. 13. N. Menezes and C. -P. Chen. Spec-based repeater insertion and wire sizing for on-chip interconnect. In Proceedings of the International Conference on VLSI Design, Goa, India, pp. 476–483, 1999. 14. L. -D. Huang, M. Lai, D. F. Wong, and Y. Gao. Maze routing with buffer insertion under transition time constraints. IEEE Transactions on Computer-Aided Design, 22(1): 91–95, January 2003. 15. C. J. Alpert, A. B. Kahng, B. Liu, I. I. Mandoiu, and A. Z. Zelikovsky. Minimum buffered routing with bounded capacitive load for slew rate and reliability control. IEEE Transactions on Computer-Aided Design, 22(3): 241–253, March 2003. 16. C. Kashyap, C. J. Alpert, F. Liu, and A. Devgan. Closed form expressions for extending step delay and slew metrics to ramp inputs. In Proceedings of the ACM International Symposium on Physical Design, Monterey, CA, pp. 24–31, 2003. 17. H. B. Bakoglu. Circuits, Interconnections and Packaging for VLSI. Addison-Wesley, Reading, MA, 1990. 18. N. H. E. Weste and K. Eshraghian. Principles of CMOS VLSI Design: A System Perspective. Addison-Wesley Publishing Company, Reading, MA, 1993. 19. S. Hu, C. J. Alpert, J. Hu, S. Karandikar, Z. Li, W. Shi, and C. -N. Sze. Fast algorithms for slew constrained minimum cost buffering. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 308–313, 2006. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C026 556 Finals Page 556 29-9-2008 #23 Handbook of Algorithms for Physical Design Automation 20. J. Cong and C. K. Koh. Simultaneous driver and wire sizing for performance and power optimization. IEEE Transactions on VLSI Systems, 2(4): 408–425, December 1994. 21. S. S. Sapatnekar. RC interconnect optimization under the Elmore delay model. In Proceedings of the ACM/IEEE Design Automation Conference, San Diego, CA, pp. 392–396, 1994. 22. J. Cong and K. -S. Leung. Optimal wiresizing under the distributed Elmore delay model. IEEE Transactions on Computer-Aided Design, 14(3): 321–336, March 1995. 23. J. P. Fishburn and C. A. Schevon. Shaping a distributed RC line to minimize Elmore delay. IEEE Transactions on Circuits and Systems, 42(12): 1020–1022, December 1995. 24. C. P. Chen, Y. P. Chen, and D. F. Wong. Optimal wire-sizing formula under the Elmore delay model. In Proceedings of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp. 487–490, 1996. 25. C. J. Alpert, A. Devgan, J. P. Fishburn, and S. T. Quay. Interconnect synthesis without wire tapering. IEEE Transactions on Computer-Aided Design, 20(1): 90–104, January 2001. 26. A. Devgan. Efficient coupled noise estimation for on-chip interconnects. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 147–151, 1997. 27. C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion for noise and delay optimization. IEEE Transactions on Computer-Aided Design, 18(11): 1633–1645, November 1999. 28. C. C. N. Chu and D. F. Wong. A new approach to simultaneous buffer insertion and wire sizing. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 614–621, 1997. 29. W. C. Elmore. The transient response of damped linear networks with particular regard to wideband amplifiers. Journal of Applied Physics, 19: 55–63, January 1948. 30. F. J. Liu, J. Lillis, and C. K. Cheng. Design and implementation of a global router based on a new layoutdriven timing model with three poles. In Proceedings of the IEEE International Symposium on Circuits and Systems, Hong Kong, China, pp. 1548–1551, 1997. 31. J. Qian, S. Pullela, and L. T. Pillage. Modeling the effective capacitance for the RC interconnect of CMOS gates. IEEE Transactions on Computer-Aided Design, 13(12): 1526–1535, December 1994. 32. S. R. Nassif and Z. Li. A more effective Ceff . In Proceedings of the IEEE International Symposium on Quality Electronic Design, San Jose, CA, pp. 648–653, 2005. 33. B. Tutuianu, F. Dartu, and L. Pileggi. Explicit RC-circuit delay approximation based on the first three moments of the impulse response. In Proceedings of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp. 611–616, 1996. 34. C. J. Alpert, F. Liu, C. V. Kashyap, and A. Devgan. Closed-form delay and slew metrics made easy. IEEE Transactions on Computer-Aided Design, 23(12): 1661–1669, December 2004. 35. C. J. Alpert, A. Devgan, and S. T. Quay. Buffer insertion with accurate gate and interconnect delay computation. In Proceedings of the ACM/IEEE Design Automation Conference, New Orleans, LA, pp. 479–484, 1999. 36. C. -K. Cheng, J. Lillis, S. Lin, and N. Chang. Interconnect Analysis and Synthesis. Wiley Interscience, New York, 2000. 37. S. Hassoun, C. J. Alpert, and M. Thiagarajan. Optimal buffered routing path constructions for single and multiple clock domain systems. In Proceedings of the IEEE/ACM International Conference on ComputerAided Design, San Jose, CA, pp. 247–253, 2002. 38. P. Cocchini. A methodology for optimal repeater insertion in pipelined interconnects. IEEE Transactions on Computer-Aided Design, 22(12): 1613–1624, December 2003. 39. W. Shi and Z. Li. A fast algorithm for optimal buffer insertion. IEEE Transactions on Computer-Aided Design, 24(6): 879–891, June 2005. 40. Z. Li and W. Shi. An O(bn2 ) time algorithm for buffer insertion with b buffer types. IEEE Transactions on Computer-Aided Design, 25(3): 484–489, March 2006. 41. Z. Li and W. Shi. An O(mn) time algorithm for optimal buffer insertion of nets with m sinks. In Proceedings of Asia and South Pacific Design Automation Conference, Yokohama, Japan, pp. 320–325, 2006. 42. W. Shi, Z. Li, and C. J. Alpert. Complexity analysis and speedup techniques for optimal buffer insertion with minimum cost. In Proceedings of Asia and South Pacific Design Automation Conference, Yokohama, Japan, pp. 609–614, 2004. 43. Z. Li, C. N. Sze, C. J. Alpert, J. Hu, and W. Shi. Making fast buffer insertion even faster via approximation techniques. In Proceedings of Asia and South Pacific Design Automation Conference, Shanghai, China, pp. 13–18, 2005. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 557 23-9-2008 #2 27 Generalized Buffer Insertion Miloš Hrkić and John Lillis CONTENTS 27.1 Introduction.. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.2 Two-Phase Approach and Buffer-Aware Tree Construction . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.2.1 C-Tree Algorithm .. . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.2.2 Buffer Tree Topology Generation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.3 Simultaneous Tree Construction and Buffer Insertion . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.3.1 P-Tree Algorithm . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.3.2 S-Tree Algorithm . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.3.3 SP-Tree Algorithm .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 27.3.4 Complete Tree Topology Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 557 560 560 561 562 562 564 566 566 566 27.1 INTRODUCTION It has been widely recognized that interconnect is a dominating factor in modern very large scale integration (VLSI) circuit designs. Chapter 26 gave an overview of challenges that interconnect faces and introduced a technique called repeater insertion that has proven to be very efficient in addressing emerging interconnect issues. Early work on repeater insertion focused mainly on improving interconnect timing performance. The most influential work is van Ginneken’s dynamic programming algorithm [1]. The algorithm performs buffer insertion on a fixed and embedded tree (e.g., as given by a global router) and produces an optimal timing solution under Elmore delay model [2]. Various generalizations of van Ginneken’s algorithm have appeared in the literature taking into account issues of practical importance such as buffer libraries with inverting and noninverting buffers, simultaneous wire sizing, and slew-based delay models. Additionally, generalizations that address natural constrained optimization variants of the problem (e.g., minimization of area or power consumption subject to timing constraints) have also appeared. Progress has also been made in improving computational complexity as well as practical runtime. Many of these results are presented in Chapter 26. A significant limitation of van Ginneken’s approach is that it requires a fixed and embedded tree that has to be provided in advance. This constraint forces the final buffered solution quality to depend on the input tree. Even though the algorithm provides an optimal timing solution for a given tree, it will produce a poor solution when given a poor tree. A few example scenarios that are very common in practice can be used to illustrate this limitation. As noted earlier, one of the basic interconnect optimization tasks is delay minimization. Given that sinks may have very different required signal arrival time constraints, a routing solution that focuses only on, for example, minimizing wirelength may not be good enough. In Figure 27.1, sinks F and G are timing critical while the others are not. Configuration in Figure 27.1a has better wirelength, but the buffering cost is very high. On the other hand, configuration in Figure 27.1b can achieve better timing results with slightly more wirelength but many fewer buffers. 557 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 558 Finals Page 558 23-9-2008 #3 Handbook of Algorithms for Physical Design Automation A B A C B C D D E F G G (a) E F (b) FIGURE 27.1 Buffering example: Sinks F and G are assumed to be critical; tree (a) has slightly smaller wirelength but requires more buffers (and may prevent timing constraints on F and G from being met) than the tree (b). + (a) − + − + − + − + − + − (b) FIGURE 27.2 Buffering example: To meet signal polarity requirements, the number of buffers that is required varies significantly from one topology to another. In some cases, certain sinks of a net require input signals of inverted polarity. Choices made during route construction can have a large impact on the cost of buffering solutions, as we can see in Figure 27.2. The two solutions Figure 27.2 have very different buffer and wiring costs. Figure 27.3 shows a simple example illustrating the issues raised during buffering and routing in the presence of blockages. In configuration of Figure 27.3a, the route goes over the blockage and cannot be buffered (thus, possibly violating timing, load, or slew constraints). If the route completely avoids the blockage, the resulting solution is expensive in terms of wire and buffer costs (Figure 27.3b). Finally, by being aware of different types of blockages, configuration in Figure 27.3c dominates both in delay and resource usages/costs. Recently, some designs have reserved internal areas of macroobjects for buffering of external nets (e.g., the whitespace in macros as in Figure 27.4). Any buffer insertion algorithm that has to work on a route that is not aware of the layout specifics will have limited chances of success. Referring to Figure 27.4, assuming that sink A is critical and the others are not, the two solutions in Figure 27.4 can have significant quality difference (e.g., cost or timing characteristics). In other practical formulations, routing or buffering feasibility is not considered a zero or one property (blocked or free). Instead, a complex cost function based on the local and global design densities and congestions should drive routing and buffering algorithms; such formulations can prevent overconstraining the design space, but require incremental interaction with placers and routers. Even more, the overall design closure can suffer because irresponsible use of buffering resources on nets (or portions of nets) that are not critical can prevent other critical nets from meeting their constraints.∗ Given the examples above, routing and buffering algorithms should be able to account for the cost/performance trade-off of the solutions that they produce. Generating the fastest buffering solution ∗ Some of the approaches that are specifically designed to target blockages (routing or placement) as well as design density and congestion are presented in more detail in Chapter 28. However, some of the ideas will be reviewed in this chapter because they are among the core components of some tree synthesis and buffering algorithms. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 559 23-9-2008 #4 559 Generalized Buffer Insertion (a) (b) (c) FIGURE 27.3 Buffering example: Depending on the interaction between routes and blockages, buffered solution can be (a) infeasible, (b) expensive, (c) or not bad at all. C 3 1 2 B 2 D A (a) C 3 1 B D A (b) FIGURE 27.4 Buffering example: With increasing complexity of constraints, ability of buffering algorithms to handle such constraints is becoming more important. Assuming that sink A in critical, solutions (a) and (b) can have significant quality difference. may be necessary for some nets, but if applied to all nets, the design would quickly become too expensive (e.g., in area and power usage), or even become impossible to manufacture. In addition, algorithm complexity and runtime is a very important practical factor given that hundreds of thousands of nets may need to be buffered within a given CPU time budget. In the following sections, we give an overview of recent research that addresses one or more of the problems mentioned above. This area of research is still very active and our summary presents only a snapshot of the past and current research. The majority of techniques that address problems mentioned above can be placed in one of the two categories. Several works propose a two-stage sequential method where a buffer-aware tree is constructed first, followed by van Ginneken style buffer insertion as in Refs. [3–6]. These techniques have small execution time with some sacrifice in solution quality and predictability. In Section 27.2, we describe techniques from Refs. [3,6] in more detail. A more robust and predictable approach proposes simultaneous route construction while performing buffer insertion. An example is the buffered P-Tree class of algorithms [7], which integrates buffer insertion into the P-Tree Steiner tree construction algorithm [8]. The P-Tree algorithm introduced a paradigm of finding an optimal solution in a constrained, but very large, space including topological, embedding, and buffering degrees of freedom, as opposed to applying ad hoc heuristics. Section 27.3 presents methods for simultaneous routing tree construction and buffer insertion from Refs. [7–12]. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 560 Finals Page 560 23-9-2008 #5 Handbook of Algorithms for Physical Design Automation 27.2 TWO-PHASE APPROACH AND BUFFER-AWARE TREE CONSTRUCTION 27.2.1 C-TREE ALGORITHM The work in Ref. [3] addresses the problem of buffering under timing and polarity constraints. Given a net with placed pins, timing and polarity requirements at sinks, driver properties, a buffer library, and the technology’s interconnect parasitics, the goal is to find a Steiner tree that, after buffer insertion, meets timing constraints while minimizing solution cost (i.e., wire and buffer usage). A two-phase flow is proposed: a buffer-aware Steiner tree construction called C-Tree is followed by a van Ginneken style buffer insertion. It is argued that an optimal buffer insertion on a fixed and routed tree can produce good/optimal results as long as it is given the right Steiner tree. However, in practice, instead of finding the right tree (which is very difficult because the tree construction algorithm is not optimizing the true objective) one can construct a buffer-aware Steiner tree, which tries to anticipate potential buffer locations. The main idea in C-Tree (clustered tree) is to construct a tree in two stages. First, sinks are clustered based on a distance metrics (timing criticality, polarity requirements, physical distance). Then, lower level trees are constructed on each cluster. After determining tapping points for each cluster, the top-level timing-driven tree is constructed, connecting the driver with cluster tapping points. Merging the top-level tree with cluster trees yields a final tree for the entire net. Sink properties used for clustering are spatial (physical location coordinates), temporal (required arrival times), and polarity. The distance metrics incorporate all three elements. They are defined separately and then combined using scaling factors into a single distance metric. The spatial distance is given by sDist(si , sj ) = |x(si )−x(sj )|+|y(si )−y(sj )|. Polarity distance is defined as pDist(si , sj ) = |pol(si ) − pol(sj )|. As for the temporal distance, Ref. [3] argues that required arrival time is not the only indicator of sink criticality. For example, if two sinks s1 and s2 have the same required arrival time, and s1 is further away from the driver, then s1 is more critical because it is harder to achieve the same required arrival time over the longer distance. Thus, an estimate of the achievable delay is used to adjust required arrival time and obtain achievable slack. It is further argued that the difference in achievable slacks (AS) may not yet be good enough. For example, if AS(s1 ) = −1 ns, AS(s2 ) = 1 ns, and AS(s3 ) = 10 ns, sinks s1 and s2 seem closer although in practice s1 is the only critical sink because s2 and s3 have high-positive AS. Thus, the sink criticality is defined as crit(si ) = eα[mAS−AS(si )]/(aAS−mAS) , where mAS and aAS are the minimum and average AS values over all sinks and α > 0 is a user parameter. The criticality is a value between 0 and 1, where 1 is the most critical (the average sink criticality by this formula is closer to noncritical). The temporal distance tDist(si , sj ) is now defined as the difference in sink criticality. Finally, the distance metric is a linear combination of spatial, temporal, and polarity distances (noting that spatial distance is normalized by spatial diameter sDiam(N) defined as the maximum distance between the sinks): β[s Dist(si , sj )/s Diam(N)] + (1 − β)t Dist(si , sj ) + p Dist(si , sj ). The clustering itself is done using K-center heuristics. It is an iterative approach, which identifies sinks that are furthest away and labels them as cluster seeds. The remaining sinks are then clustered around the closest seed. More details can be found in Ref. [3]. Once the clusters are determined, timing-driven Steiner trees are constructed on each cluster and one on the top level using the Prim–Dijkstra algorithm from Ref. [13]. The experimental results show that this technique often exhibits a good trade-off between runtime and the quality of results (i.e., providing good solutions on the average in terms of both the cost and the delay while keeping low runtime). In addition, this method is not very complicated to implement. One should be aware of the fact that this algorithm is not designed to handle obstacles and design congestion in general, so results may not be very predictable in those scenarios. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C027 Finals Page 561 23-9-2008 #6 561 Generalized Buffer Insertion 27.2.2 BUFFER TREE TOPOLOGY GENERATION A more recent work [6] also recognizes the problem of buffering fixed trees, together with the growing problem of design size, where millions of nets have to be optimized in a reasonable amount of time. This work presents a new algorithm for generation of tree topologies that are buffer-friendly. The algorithm balances achieving the signal required arrival time constraints and minimizing wirelength. Let us first explain the notion of the tree topology in this work (we will refer to it as a partially embedded tree topology). Figure 27.5a shows a partially embedded tree topology. It is a directed tree structure where each node except the root has only one input edge, each internal node has exactly two output edges, while the root has only one output edge. In addition, each node has an assigned placement location (placement overlap is allowed). However, the embeddings of the edges (i.e., routes) as well as the number of buffers and buffer placements are not specified. An example of a completely embedded and buffered tree topology is given in Figure 27.5b. Once the partially embedded topology tree is constructed, many of the known techniques can be used to perform two-pin routing and buffer insertion between the tree nodes (i.e., Refs. [14–16]). As opposed to the approach in Ref. [3], subtree parities (i.e., signal polarities) are resolved locally because inverters are being used for buffering. The algorithm proceeds in the following manner. First, sinks are ordered based on criticality (the most critical first). In a manner similar to Ref. [3], criticality estimation is based on estimated slack rather than only relying on sink required arrival time. To estimate the delay from the driver to sinks, a linear delay model is used (similar to Ref. [5]) augmented by estimated buffer intrinsic delays and loads. The assumption is that these paths are going to be buffered eventually so the algorithm accounts for the delay that the path is going to have after buffering. In Ref. [6], some additional experiments are performed to justify this assumption and results show good correlation between estimates and final results. When the ordering is complete, sinks are added to the topology one at a time (the initial topology consists of the driver and the most critical sink only). A single sink insertion is performed by examining all edges in the current topology and finding the closest tapping point within the bounding box of the edge terminals (note that the topology is partially embedded and all nodes have fixed placement locations). The edge for which the overall slack has the best value is chosen and sink insertion is performed by breaking that edge and inserting a new internal node to the tree. The parent of the new node is the source of the chosen edge and the children are the newly inserted sink and the destination node of the chosen edge. By keeping the arrival times at each topology node, a single sink insertion can be performed in linear time, giving the overall quadratic algorithm complexity (note that each operation is fairly simple, which leads to a very small execution time). In addition, Ref. [6] proves theoretical lower bounds on slack and wirelength in two extreme cases: sinks close to the driver and sinks having large noncritical required arrival times. Among the D D A (a) C B A (b) C B FIGURE 27.5 (a) Partially embedded routing tree topology and (b) completely embedded and buffered tree.

Handbook of algorithms for physical design automation part 58

Nội dung