Handbook of algorithms for physical design automation part 92.pdf (Sổ tay thuật toán)

Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 892 Finals Page 892 23-9-2008 #13 Handbook of Algorithms for Physical Design Automation FFi D ti FFj Dij Q ticlk2q D Q Combinational logic FIGURE 42.11 Clock hazards and timing constraints. FFi and FFj as shown in Figure 42.11. Let ti and tj be the clock delays from clock source to FFi and FFj , respectively. Let Dij be the set of all combinational path delays from FFi to FFj . Let ticlk2q be the clock-to-Q delay for FFi . Let tjsetup and tjhold be the setup time and hold time for FFj , respectively. Let P be the clock period. The setup time and hold time constraints can be expressed as setup ti + ticlk2q + MAX Dij + tj ≤ tj + P (42.2) ti + ticlk2q + MIN Dij ≥ tj + tjhold (42.3) A clock schedule is a set of delays from clock source to all registers in the synchronous system. The clock scheduling problem is to find a clock schedule {t1 , . . . , tN } for all registers FF1 , . . . , FFN to minimize the clock period P while satisfying the constraints in Equations 42.2 and 42.3. This problem can be formulated as a linear program as follows [20]: LP_SPEED: Minimize P subject to tj − ti ≥ tjsetup + ticlk2q + MAX Dij − P ti − tj ≥ tjhold − ticlk2q − MIN Dij ti ≥ MIN_DELAY for i, j = 1, . . . , N for i, j = 1, . . . , N for i = 1, . . . , N Alternatively, we can find a clock schedule to maximize the minimum safety margin M for a given clock period P. This problem can be formulated as a linear program as follows: LP_SAFETY: Maximize M setup subject to tj − ti ≥ tj + ticlk2q + MAX Dij − P + M ti − tj ≥ tjhold − ticlk2q − MIN Dij + M ti ≥ MIN_DELAY for i, j = 1, . . . , N for i, j = 1, . . . , N for i = 1, . . . , N. In both formulations, MAX(Dij ) = −∞ and MIN(Dij ) = ∞ if there is no combinational path from FFi to FFj . After the clock schedule S = {t1 , . . . , tN } is computed, the next step is to construct a clock network to realize the obtained schedule. The DME algorithm in Section 42.2.4 can be easily extended to handle this problem. We only need to construct the merging segments to achieve the given skews instead of zero skews in the bottom-up phase of the DME algorithm. However, the solutions of the linear programs may not be unique. Each clock delay ti could be a range rather than a fixed value. In this case, the clock routing problem becomes the bounded-skew routing tree (BST) problem. In Ref. [21], Cong et al. proposed two algorithms, BME (boundary merging and embedding) and IME (interior merging and embedding), to handle this problem. These two algorithms extend the DME algorithm by finding a polygonal region based on the skew bounds rather than a merging segment to represent all possible locations for each tapping point. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Finals Page 893 23-9-2008 #14 893 Clock Network Design: Basics Apart from the original formulations of clock scheduling, there are some other extensions. Neves and Friedman [22] formulated the process variation tolerant optimal clock skew scheduling problem. To better control the effects of process variations, they find the permissible range (i.e., the range of the clock skew without timing violation) for each local path, select a clock skew value that allows a maximum variation of skew within the permissible range, and finally determine the clock delay to each register. Recently, Ravindran et al. [23] discussed the multidomain clock skew scheduling problem. For a given number of clocking domains n and a maximum permissible within-domain latency δ, the multidomain range constraints require that all clock latencies must fit into n value ranges (l(di ), l(di ) + δ) for i = 1, . . . , n. The objective of multidomain clock skew scheduling is to determine domain phase shifts l(di ) and register latencies that satisfy the clock domain constraints and minimize the clock period. Finally, we want to have a brief discussion on two similar sequential optimization techniques, clock scheduling and retiming. They are, respectively, continuous and discrete optimizations with the same effect on minimizing the clock period [20]. The equivalence of the two techniques was studied in Ref. [24]. It is proved that there exists a retiming R to achieve clock period P if and only if there exists a clock schedule S with the same clock period. However, the practical use of retiming is limited due to two reasons. First, retiming has adverse impact on the verification methodology. Second, using retiming for maximum performance often causes a steep increase in the number of registers. Clock scheduling does not have these two limitations. Another advantage of clock scheduling is that because retiming can only move registers across discrete amounts of logic delay, the resulting system after retiming can still benefit from clock scheduling. 42.5 HANDLING VARIABILITY In minimizing skew sensitivity to process variations, two guiding principles are that the network should be as symmetrical and as fast as possible. In a clock network designed and laid out symmetrically, chipwide process (or environmental) variations should affect all clock paths identically. An additional advantage is that any systematic skew caused by modeling errors is eliminated by symmetry. In a fast network, as the clock phase delay is small, any fractional variations in delay lead only to a modest amount of skew. In addition, a clock network with optimal delay is the most tolerant to process variations. At the optimal delay point with respect to a certain parameter, the delay sensitivity over that parameter (i.e., the slope of the delay function) should be zero. However, it is not trivial to apply these two principles in practice. Because of uneven load distribution and routing/buffer obstacles, it is usually impossible to construct a completely symmetrical network. Moreover, minimizing the network delay may be conflicting with the optimization of some other metrics (e.g., skew, area and power). Several important works on reliable clock network design under process variations are discussed below. The concept of delay sensitivity is very useful in considering process variations. Pullela et al. [25] first made use of delay sensitivity with respect to wire width variations to improve the delay, skew and skew sensitivity of a given clock tree by wire width optimization. The Elmore delay model and the L-type RC model for each branch are used in the paper, but the concept can be generalized to other models. Let Rj be the resistance, Cj be the capacitance, and Cdj be the downstream capacitance of branch j. Let U(i) be the set of all branches on the path from sink i to the root. Then the Elmore delay from the root to sink i is Tdi = j∈U(i) Rj Cdj . Therefore, the sensitivities of Elmore delay of sink i with respect to circuit parameters Cj and Rj are ∂Tdi = Rcij ∂Cj ∂Tdi Cdj if j ∈ U (i) = ∂Rj 0 otherwise (42.4) (42.5) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 894 Finals Page 894 23-9-2008 #15 Handbook of Algorithms for Physical Design Automation where Rcij is the total resistance along the common path from sink i to the root and branch j to the root. Rj and Cj can be expressed as functions of width wj of branch j as Rj = R0 Lj /wj and Cj = Ca Lj wj + Cf Lj , where R0 , Ca , and Cf are technology parameters independent of wj , and Lj is the length of branch j. Therefore, the delay sensitivity of sink i to width wj is ∂Tdi ∂Cj ∂Tdi ∂Rj ∂Tdi = + ∂wj ∂Cj ∂wj ∂Rj ∂wj = ∂Tdi ∂Tdi R0 Lj Ca Lj − ∂Cj ∂Rj wj2 By incremental computation as described in Ref. [26], Equations 42.4 and 42.5 for all i and j can be computed in O(n2 ) time for a tree with n sinks.∗ Hence, the delay sensitivities for all sinks to all branch widths can also be found in O(n2 ) time. In Ref. [25], a greedy heuristics is proposed to iteratively increase the widths to improve delay, skew, and skew sensitivity. The selection of the branch to widen in each step is based on the delay sensitivities, which give the delay change of each sink when widening a branch. In particular, they argued that wire widening is a better method for delay balancing than wire elongating as widening generally reduces skew sensitivity but elongating increases it. Lu et al. [27] formulated the minimizing skew violation (MinSV) problem to construct a clock tree considering wire width variation due to process variations. Given the range of permissible skew for each pair of clock sinks, they tried to find a clock routing tree such that the maximum skew violation among all pairs of sinks is minimized under wire width variation. The way they construct the tree follows the framework of the DME algorithm. Because of wire width variation, the skew between a sink pair becomes a range rather than a unique value. To maximize the safety margin due to process variations, in the bottom-up stage, they chose the merging segment for the tapping point such that the center of the skew range of the most critical sink pair coincides with the center of permissible range for this sink pair. Besides improving process variation tolerance, they also proposed an algorithm to minimize wirelength when there is no skew violation under wire width variation. Recently, Rajaram et al. [28] proposed to insert cross links in a given clock tree to improve its skew sensitivity. Like the grid and the spine structures, the cross links equalize delay of different points by connecting them together. Such an approach can tolerate both process and environmental variations. Moreover, because the cross links are selectively inserted based on the trade-off between skew sensitivity reduction and extra wire usage, this approach can achieve significant skew sensitivity reduction with little increase in wirelength. The link insertion algorithm is improved in Ref. [29]. REFERENCES 1. J. M. Rabaey, A. Chandrakasan, and B. Nikolić. Digital Integrated Circuits: A Design Perspective, 2nd edn. Prentice Hall, 2003. 2. H. Veendrick. Short-circuit dissipation of static CMOS circuitry and its impact on the design of buffer circuits. IEEE Journal of Solid-State Circuits, SC-19:468–473, August 1984. 3. P. J. Restle, T. G. McNamara, D. A. Webber, P. J. Camporese, K. F. Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petronick, B. L. Krauter, and B. D. McCredie. A clock distribution network for microprocessors. IEEE Journal of Solid-State Circuits, 36(5):792–799, May 2001. ∗ In [25], an O(n2 log n) algorithm by adjoint analysis is proposed. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 Clock Network Design: Basics Finals Page 895 23-9-2008 #16 895 4. M. Gowan, L. Biro, and D. Jackson. Power considerations in the design of the Alpha 21264 microprocessor. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 433–439, 1998. 5. D. R. Gonzales. Micro-RISC architecture for the wireless market. IEEE Micro, 19(4):30–37, 1999. 6. D. E. Duarte, N. Vijaykrishnan, and M. J. Irwin. A clock power model to evaluate impact of architectural and technology optimizations. IEEE Transactions on Very Large Scale Integration Systems, 10(6):844–855, December 2002. 7. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh. Clock routing for high-performance ICs. In Proceedings of the ACM/IEEE Design Automation Conference, Orlando, FL, pp. 573–579, 1990. 8. J. Cong, A. B. Kahng, and G. Robins. Matching-based methods for high-performance clock routing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(8):1157–1169, August 1993 (DAC 1991). 9. R. -S. Tsay. An exact zero-skew clock routing algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 12(2):242–249, February 1993 (ICCAD 1991). 10. W. C. Elmore. The transient response of damped linear network with particular regard to wideband amplifiers. Journal of Applied Physics, 19:55–63, 1948. 11. M. Edahiro. Minimum skew and minimum path length routing in VLSI layout design. NEC Research and Development, 32(4): 569–575, 1991. 12. T. -H. Chao, Y. -C. Hsu, and J. -M. Ho. Zero skew clock net routing. In Proceedings of the ACM/IEEE Design Automation Conference, Anaheim, CA, pp. 518–523, 1992. 13. K. D. Boese and A. B. Kahng. Zero-skew clock routing trees with minimum wirelength. In Proceedings of the IEEE International ASIC Conference, Rochester, NY, pp. 1.1.1–1.1.5, September 1992. 14. J. G. Xi and W. W. -M. Dai. Buffer insertion and sizing under process variations for low power clock distribution. In Proceedings of the ACM/IEEE Design Automation Conference, San Francisco, CA, pp. 491–496, 1995. 15. A. Vittal and M. Marek-Sadowska. Low-power buffered clock tree design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 965–975, September 1997 (DAC 1995). 16. S. Pullela, N. Menezes, J. Omar, and L. T. Pillage. Skew and delay optimization for reliable buffered clock trees. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 556–562, 1993. 17. B. J. Benschneider, A. J. Black, W. J. Bowhill, S. M. Britton, D. E. Dever, D. R. Donchin, R. J. Dupcak, R. M. Fromm, M. K. Gowan, P. E. Gronowski, M. Kantrowitz, M. E. Lamere, S. Mehta, J. E. Meyer, R. O. Mueller, A. Olesin, R. P. Preston, D. A. Priore, S. Santhanam, M. J. Smith, and G. M. Wolrich. A 300-MHz 64-b quad-issue CMOS RISC microprocessor. IEEE Journal of Solid-State Circuits, 30(11):1203–1214, November 1995 (ISSCC 1995). 18. Shen Lin and C. K. Wong. Process-variation-tolerant clock skew minimization. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 284–288, 1994. 19. H. Su and S. S. Sapatnekar. Hybrid structured clock network construction. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 333–336, 2001. 20. J. P. Fishburn. Clock skew optimization. IEEE Transactions on Computers, 39(7):945–951, July 1990. 21. J. Cong, A. B. Kahng, C. -K. Koh, and C. -W. A. Tsao. Bounded-skew clock and Steiner routing. ACM Transactions on Design Automation of Electronics Systems, 3(3):341–388, 1998 (ICCAD 1995). 22. J. L. Neves and E. G. Friedman. Optimal clock skew scheduling tolerant to process variations. In Proceedings of the ACM/IEEE Design Automation Conference, Las Vegas, NV, pp. 623–628, 1996. 23. K. Ravindran, A. Kuehlmann, and E. Sentovich. Multi-domain clock skew scheduling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, pp. 801–808, 2003. 24. L. -F. Chao and E. H. -M. Sha. Retiming and clock skew for synchronous systems. In Proceedings of the IEEE International Symposium on Circuits and Systems, London, England, pp. 283–286, 1994. 25. S. Pullela, N. Menezes, and L. T. Pillage. Reliable non-zero skew clock trees using wire width optimization. In Proceedings of the ACM/IEEE Design Automation Conference, Dallas, TX, pp. 165–170, 1993. 26. C. -P. Chen and D. F. Wong. A fast algorithm for optimal wire-sizing under Elmore delay model. In Proceedings of the IEEE International Symposium on Circuits and Systems, vol. 4, Atlanta, GA, pp. 412–415, 1996. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C042 896 Finals Page 896 23-9-2008 #17 Handbook of Algorithms for Physical Design Automation 27. B. Lu, J. Hu, G. Ellis, and H. Su. Process variation aware clock tree routing. In Proceedings of the International Symposium on Physical Design, Monterey, CA, pp. 174–181, 2003. 28. A. Rajaram, J. Hu, and R. Mahapatra. Reducing clock skew variability via cross links. In Proceedings of the ACM/IEEE Design Automation Conference, Anaheim, CA, pp. 18–23, 2004. 29. A. Rajaram, D. Z. Pan, and J. Hu. Improved algorithms for link-based non-tree clock networks for skew variability reduction. In Proceedings of the International Symposium on Physical Design, San Francisco, CA, pp. 55–62, 2005. Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 897 24-9-2008 #2 Issues in Clock 43 Practical Network Design Chris Chu and Min Pan CONTENTS 43.1 IBM S/390 .. . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.2 IBM Power4 .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.3 Alpha 21264 .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.4 Intel Pentium II . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.5 Intel Pentium III . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.6 Intel Pentium 4 . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.7 Intel Itanium .. . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 43.8 Intel Itanium 2 . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 898 900 901 904 905 905 907 909 911 In this chapter, we present the clock network designs of several high-performance microprocessors to illustrate how the basic techniques presented in Chapter 42 are applied in practice. We focus on the clock network design of high-performance microprocessors as the stringent slew requirements make the design most challenging. Some useful discussions on practical issues in clock network design can also be found in Bindal and Friedman [1], Zhu [2], and Rusu [3]. Year/ Main Reference Process (nm) Clock Frequency (MHz) Number of Area Transistors (mm2 ) (M) Section Processor 43.1 43.2 IBM S/390 IBM Power4 1997 [4] 2002 [5] 200 (Leff ) 180 SOI 400 1300 43.3 Alpha 21264 1998 [6] 350 600 43.4 43.5 43.6 Pentium II Pentium III Pentium 4 Itanium 350 250 180 90 180 300 650 2000 Up to 5000 800 203 123 217 109 43.7 1997 [7] 1999 [8] 2001 [9] 2003 [10] 2000 [11] 43.8 Itanium 2 2002 [12] 180 1000 421 25 2003 [13] 130 1500 374 410 2005 [14] 90 100–2500 596 1720 (dual core) 300 7.8 174 15.2 7.5 9.5 42 25.4 Clock Topology Tree Tree driving single grid Hierarchical grids 1 spine 2 spines 3 spines 8 spines Tree driving grids Tree driving trees Tree driving trees Hierarchical trees Deskew Skew (ps) No No 30 25 No 65 No Active Active Active 140 15 16 10 28 No 62 Fuse based Active 24 10 897 Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 898 Finals Page 898 24-9-2008 #3 Handbook of Algorithms for Physical Design Automation The processors discussed in this chapter are summarized in the table above. (Some entries are left blank because the corresponding information cannot be found.) 43.1 IBM S/390 FXU dataflow CLKD 32 KB cache Address flow 32 KB cache I/E trace Instruction flow PLL BCE logic Instruction flow FPU dataflow FXU control RU I/E trace Address flow FPU control FPU control FXU control FXU dataflow FPU dataflow The design of a 400-MHz microprocessor for IBM S/390 Enterprise Server Generation-4 system is described in Ref. [4]. The chip is fabricated in a 0.2-µm Leff CMOS technology with five layers of metal and tungsten local interconnect. The power supply is 2.5 V. The chip size is 17.35 mm × 17.30 mm with about 7.8 million transistors. The clock distribution network uses a balanced tree design, which is suitable for the relatively low clock frequency. A single-phase clock is distributed from a phase-locked loop (PLL)/central clock buffer located near the center of the chip to all the latches inside the macros in three levels of hierarchy. The first two levels of clock distribution are in the form of balanced H-like trees, using primarily the top two metal layers. The first-level tree routes the global clock from the central clock buffer to nine sector buffers, as shown in Figure 43.1. The sector buffers repower the clock to all macros inside the sectors. There are 580 macro clock pins in the whole design. TLB_ABS Directory IU control 16 KB ROS TLB_LOG IU control 16 KB ROS Clock sector buffer Clock waveform measurement point FIGURE 43.1 First-level tree of the IBM S/390 clock distribution network. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 899 24-9-2008 #4 899 Practical Issues in Clock Network Design The clock propagation delay along the tree is balanced against macro input capacitance and RLC characteristics of the tree wires. Horizontal wiring of each tree is in low-resistance Metal 5 (M5) (with 4.8-µm pitch). At various places along the tree, inductive coupling is reduced and return path is improved by using power wires for shielding. Decoupling capacitors are incorporated into central and sector buffers to reduce delta-I noise. A clock wiring methodology was developed with custom routing and timing computer-aided design (CAD) tools. The detailed routing as well as the widths of all clock wires were optimized to minimize skew, mean delay, power, wiring tracks, and sensitivity to process variations. Three-dimensional (3D) modeling was performed using a full-wave electromagnetic field solver [15], and distributed RLC modeling was used for virtually every wire in all the trees during the design and tuning/optimization process [16]. A number of cases were analyzed, and the results were used to generate a combination of analytic models and lookup tables containing distributed RLC parameters for all clock geometries used. Each wire segment was represented by an equivalent circuit consisting of up to six RLC π-segments. Extensive simulations and wire width tuning [17] were done to guarantee low clock skew at macro pins. Typical simulated RLC delay of the first-level tree is 300 ps with 20 ps skew at the sector buffers. The sector buffer delay is 230 ps. Typical simulated RLC delay within sectors is 210 ps with 30 ps skew at the macros. The last level of clock distribution is local to each macro. Figure 43.2 shows the clocking scheme within macros. From the macro pin, the clocks are wired to clock blocks. The overall target skew for this wire is under 20 ps. For large area macros, multiple clock pins were used to reduce wirelength to clock blocks. The clock block generates local clocks that drive latches. The target skew for local clocks is under 50 ps. All macrolevel wiring is done by hand for custom macros or with a place and route tool for synthesized macros. For synthesized macros that had many latches, and therefore multiple clock blocks, a clock optimization tool was used that reassigned latches to clock blocks based on cell placement. This resulted in clock blocks driving latches that were placed closest to them. Macro layouts were extracted for R and C parasitics, and the extracted netlists were used to time the macros. This means that any skew in the last level of clock distribution was captured in that macro’s timing abstraction. Figure 43.3 shows the measured waveforms of the central clock buffer output and clocks at ten points of the 580 macro pin locations (marked on Figure 43.1) driven by the second level clock tree. The measurement was performed using a novel electron-beam prober with a 20-ps time resolution on the top wiring layer. Because the chip was powered using a standard cantilever probe card in the Clock chopper CLKL L2 L2 Combinational logic CLKG L1 L 2 L1 L 2 C2 C1 Clock splitter FIGURE 43.2 Last/macrolevel clock distribution of IBM S/390. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 900 Finals Page 900 24-9-2008 #5 Handbook of Algorithms for Physical Design Automation 2.5 Volts 2.0 1.5 1.0 Clock at ten of 580 macro pins in second level clock tree Central clock buffer output 0.5 0.0 500 1000 1500 FIGURE 43.3 Electron beam measured clock waveforms at macro pin locations marked on Figure 43.1. (From Welb, C.F. et al., J. Solid-State Circuits, 32, 1665, 1997. With permission.) electron-beam prober, the chip clock was run at low frequency to reduce power supply noise. Power supply noise during these measurements was measured to be less than 100 mV. The results indicate a mean delay of 740 ps and less than 30 ps skew from the central clock buffer to the macro pins. 43.2 IBM POWER4 The clock distribution of a 1.3-GHz Power4 microprocessor is described in Refs. [5,18]. The chip is fabricated in the IBM 0.18-µm CMOS 8S3 SOI (silicon-on-insulator) technology with seven levels of copper wiring. It has 174 million transistors. The power supply is 1.6 V. The microprocessor uses a single chip-wide clock domain, with no active or programmable skew-reduction circuitry. Having multiple domains would allow active/programmable deskewing and coarse clock gating, and could result in lower skew within each small domain. Inevitably, however, with multiple domains there is increased skew and uncertainty between domains. In addition, multiple clock domains complicate early- and late-mode timings, and degrade critical paths that cross multiple domain boundaries. Extensive simulations of the Power4 chip and test-chip hardware measurements support the simplifying decision to maintain a single-domain global clock grid for the entire chip, with no programmable or active deskewing. The global clock distribution strategy is based on a topology using a number of tuned trees driving a single full-chip clock grid [19]. This strategy is developed with the goal of being applicable to a variety of high-performance server microprocessors. It has been previously used in three S/390 chips and three PowerPC chips [19]. The trees-driving-grid topology combines many of the advantages of both trees and grids. Trees have low latency, low power, minimal wiring track usage, and the potential for very low skew. However, without the grid, trees must often be rerouted whenever the locations of clock pins change, or when the load capacitance values change significantly. The grid provides a constant structure so that the trees and the grids they are driving can be designed early to distribute the clock near every location where it may be needed. The regular grid also allows simple regular tree structures. This is important as it facilitates the design of carefully designed transmission line structures with well-controlled capacitance and inductance. The grid reduces local skew by connecting nearby points directly. The tree wires are then tuned to minimize skew over longer distances. The global clock distribution network of the 1.3-GHz Power4 chip is illustrated in Figure 43.4 using a 3D visualization showing all wire and buffer delays. In the network, a PLL near the center of the chip drives buffered H-trees, which are designed as symmetrically as possible. The H-trees Alpert/Handbook of Algorithms for Physical Design Automation AU7242_C043 Finals Page 901 24-9-2008 #6 901 Practical Issues in Clock Network Design Grid Tuned sector trees 0 80 0 70 Sector buffers level 4 0 Delay (ps) 60 0 50 Buffer level 3 0 40 0 30 Y 00 2 0 10 Buffer level 2 X Buffer level 1 FIGURE 43.4 3D visualization of the Power4 global clock distribution. (From Restle, P.J. et al., Proc. IEEE Intl. Solid-State Circuits Conf., 2002, pp. 144–145. With permission.) drive the final set of 64 carefully placed sector buffers. Each sector buffer drives a tunable sector tree network, designed for minimum delay without length matching. These sector trees are tuned primarily by wire-width tuning. Then they all drive a single full-chip clock grid at 1024 evenly spaced points. From the global clock grid, a hierarchy of short clock routes completed the connection from the grid down to the individual local clock buffer inputs in the macros. There are 15,200 global clock pins. It is reported in Ref. [5] that the maximum skew measured at 19 places with picoprobes is 25 ps, and the maximum skew by picosecond imaging for circuit analysis (PICA) measurements from nine sector buffers is less than 18 ps. 43.3 ALPHA 21264 The clocking design of a 600-MHz Alpha 21264 microprocessor is presented in Ref. [6]. The chip is fabricated in a 0.35-µm CMOS process with six metal layers. Four metal layers (called M1 to M4) are for signals, one (between M2 and M3) is for a VSS reference plane, and one (above M4) is for a VDD reference plane. It has 15.2 million transistors. This microprocessor employs a hierarchical clock distribution scheme as illustrated in Figure 43.5. At the top level, there is a global clock grid called GCLK, which covers the entire die. Next, there are six major clock grids over certain execution units. At the bottom level, local clocks are generated as needed from any clock (global clock, major clocks, or other local clocks). Previous Alpha microprocessors use a single grid to distribute the global clock signal [20,21]. The hierarchical scheme is chosen for this microprocessor because of tighter skew constraints, the importance of clock power minimization, and the need of a flexible clocking methodology to solve local timing problems. The drawback is that skew management becomes much more complicated. State elements and clocking points exist from 0 to 8 stages past GCLK. The clock distribution network needs to be carefully designed based on rigorous and thorough timing verification. The GCLK grid is driven by a global clock distribution network as shown in Figure 43.6. The network connects a PLL located in a corner of the chip to 16 distributed global clock drivers. The arrangement of global clock drivers, which resembles four windowpanes, achieves low skew by dividing the chip into regions, thus reducing the maximum distance from the drivers to the farthest loads. A windowpane arrangement also reduces sensitivity to process variation because each grid pane is redundantly driven from four sides. In general, distributing the drivers widely across the chip

Handbook of algorithms for physical design automation part 92

Nội dung