# An Efficient EPI and Energy Consumption of 32 bit ALU Using Shannon Theorem Based Adder Approach

C.Senthilpari, G.Ramanamurthy, P.Velrajkumar Faculty of engineering and technology, Multimedia University, 75450 Jalan Ayer Keroh lama, Melaka Email: <u>c.senthilpari@mmu.edu.my</u>

**Abstract:** This paper proposed two full adder circuits such as mixed Shannon and Shannon theorem based adder circuits. The mixed adder circuit developed by using MCIT for the sum operation and Shannon technique for carry and other one designed completely by using Shannon theorem. The ALU circuit consists of AND, OR, multiplexer and adder circuits that are designed by using proposed Shannon theorem. The 32-bit ALU circuits are analyses by BSIM 4 parameter analyzer. The power dissipation of total circuit, propagation delay and area are analyzed for 32 bit ALU. The binvert Shannon adder based ALU circuit gives better performance in terms of power dissipation, propagation delay, and throughput than bit-sliceALU circuit.

Key Words: ALU, MCIT, Shannon theorem, CPL, Power Dissipation, Throughput.

### **1. INTRODUCTION**

Reductions in power dissipation and improvements in speed require optimisation at all levels of the design. The proper circuit style and methodology are important factors to be considered for low power design. High-speed computation is now the expected norm for the average user [1]. The power dissipation has become a critical design metric for an increasingly large number of Very Large Scale Integration (VLSI) circuits [2]. The reduction of power is starting from the core component of the design circuits. In this paper, adder is the main component of Arithmetic Logic Unit (ALU) circuit, which designed by using two techniques of the adder circuits. The first adder is designed by using Shannon theorem, which is implemented for carry and Multiplexing Control Input Technique (MCIT) to implement the sum, which is known as Mixed Shannon adder [3]. In the second approach, both sum and carry are implementing by using Shannon theorem. Although, the number of transistor as well as area are incrased in Shannon technique, but it will reduce the critical path due to transistor tree regular arrangement and reduced the critical component in the structure compared with mixed Shannon circuit. Shannon baseed ALU gives lower power dissipation and lower delay than mixed Shannon technique [3].

The operations performed by an ALU's are controlled by a set of operation-select inputs. In this paper, we have designed Shannon adder based 32bit ALU with 4 operation-select inputs,  $S_0$ ,  $S_1$ ,  $S_2$ and  $S_3$  Logical operations take place on the bits that comprise a value, while arithmetic operations treat inputs and outputs as 2's complement integers. The ALU circuits are simulated and layouts are generated by using DSCH3 and Microwind 3 VLSI CAD tool. The simulated layouts are analyzed for power dissipation, propagation delay and area. The parameter analysis was performed by BSIM 4 analyzer.

## 2. DESIGN METHODS

The ALU circuit consists of AND, OR, multiplexer, adder/subtractor circuit that are designed by using reduction of Boolean identies and Multiplexing Control Input Technique (MCIT). Multiplexing Control Input Techniques refer to a CMOS-type logic family that possesses certain advantages. CPL uses pass transistors to select between possible inverted output values of the logic. The output node drives an inverter to generate the non-inverted output signal. Since inverted and noninverted inputs are needed to drive the gates of the pass-transistors, the complement of the logic also needs to exist to select between the possible noninverted output values and then to drive an inverter to generate an inverted version of the output. The two inputs (control input and actual input) of the pass transistor circuit can pass through the source and gate of the transistor terminals [4]. The MCIT could assume any input as a control signal [5]. In this paper, logic circuits are designed using pass transistors. The pass transistor logic gates are designed using reduction of Boolean identities and MCIT. The two inputs (control input and actual input) are supplied to the pass transistor circuit through the source and gate of the transistor terminals. According to Boolean identities, the inputs are activated. The design of the 2:1 MUX

requires two inputs A and B, as shown in Fig. 2(a). The data inputs are supplied through the transistor's source and control inputs, whereas control inputs are given through the gate of the pass transistors. The multiplexer circuit has two possible inputs; the idea is basically to label these two choices with as few bits as possible [6]. When control input c = 0, A is directed to the 2:1 output. When control input C = 1, B is directed to the output. The 4:1 multiplexer design requires 13 transistors. The worst propagation delay occurs between a selection input and an output signal. The critical path of the new multiplexer is composed of two inverters and one double control tri-state buffer. Due to the increased number of transistors in series, careful transistor sizing is necessary in order to minimise the delay of a double control tri-state buffer in this multiplexer [7]. The operation of the new circuit can be described as follows: when both select inputs are forced low, buffer 1 is on, whereas buffers 2 and 3 are tri-stated, leading to the transfer of input  $I_1$  to the output. When  $S_0$  and  $S_1$  are forced low and high, respectively, buffer 3 transmits  $I_1$  to the output. Finally, when  $S_0$  is high; the  $I_2$  input is transferred to the output independently by the status of the  $S_1$ control line. The multiplexer can possibly be realized as a 3:1 multiplexer standard cell by using tri-state buffers without significantly varying the critical path with respect to that of a 2-input multiplexer. The critical path of designed new multiplexer comprises two inverters and one double control tri-state buffer. Our MCIT logic gates and 2:1 MUX circuits contains 4 transistors each and 4:1 MUX circuits contains 13 transistors, whereas the conventional logic circuits and 2:1 MUX circuits contains 6 transistors each [2]. The conventional 4:1 MUX circuits contains 15 transistors. Basically, we reduced 2 transistors in each circuit by using MCIT than conventional circuit.

# 2.1 Architecture of the Proposed mixed Shannon Adder Cell

The expansion and minimization of Shannon function may be achieved by using multiplexer. Multiplexer take n selects inputs, and 2n data inputs, and gives one output. Once expanded any Boolean function about any number of variables, it may use the variables that the function was expanded about as the select inputs, and their respective composed functions as the corresponding data inputs. Due to this concept, the arrangement of transistor is equal in every branches of tree as shown in Fig.1 (a). The proposed full Shannon based adder circuit is designed using Shannon theorem. An input B and its complement are used as the control signal of the sum circuit. The two-input XOR gate is developed using by the multiplexing method. According to standard full adder equation, the sum circuits need three inputs. In order to avoid increasing number of transistors due to the addition of a third input, the following arrangement is made, the CPL XOR gate multiplying with C's complement input and EXNOR gate is multiply with input C, and thereby reducing the number of transistors in the sum circuit. Compared with our mixed Shannon, (shown in Fig. 1(b)) [3], this kind of arrangement cause an increase in the number of transistor but avoid the critical path delay. The C and C output node is called the differential node of the circuit. Two complementary (C and B) inputs in the full adder carry circuit, are used for the balancing the circuit and to avoid the floating wire concept. The Shannon theorem based adder circuit is slightly better than CMOS and CPL full adder circuits due to the less critical path and regular arrangement of transistor tree structure.



Fig. 1(a)





Fig. 1 (e)

Fig 1 (f)

Fig.1 (a) Proposed Shannon adder 1 (b) Mixed Shannon adder, 1 (c) 1-bit ALU bit –slice method, 1 (d) 1-bit ALU Binvert method, 1 (e) 32 bit ALU, 1 (f) 1 bit ALU binvert Shannon timing diagram

#### 2.2 ALU Design

In this paper, we have used two design methods for ALU circuit; ALU bit slice method and Binvert ALU method that are shown in Fig. 1(c) and Fig.1 (d). The designed circuit's inputs represent binary numbers and it is often easier to consider a circuit designed for a single pair of bit, rather than for the entire binary number [8]. The advantage of this approach is that it is easy to add new operations to instruction set, simply by associating an operation with a multiplexer control code, provided that MUX has sufficient capacity, otherwise, new data lines must be added to the MUX (es), and the CPU must be modified to accommodate these changes. As a result the ALU consists of 32 MUXes, arranged in parallel, to send output bits from each operation to the ALU output [9]. The 32-bit ALU can be simply constructed from the one-bit ALU by using buffers and chaining the carry bits, such that CarryIn<sub>i+1</sub> = Carry out, which is shown in Fig. 1 (e). This yields a composite ALU with two 32-bit input vectors *a* and *b*, whose i<sup>th</sup> bit is denoted by a<sub>i</sub> and b<sub>i</sub>, where i = 0-31. The result is also a 32-bit vector, and there are two control buses - one for Binvert, and one for selecting the operation. There is one Carry Out bit. The 1 bit ALU bitslice Shannon timing diagram is shown in Fig. 1(f).

#### **3. RESULTS AND DISCUSSION**

The inputs A [0:31] and B [0:31] are feed through a hexadecimal input pad. According to MUX selection inputs, output operation of the ALU is selected. The parallel adder circuit requires, in the worst case delay, the signal to travel through all 32 single bit adders. Table I gives the simulated results of AND, OR, 2:1MUX 4:1 MUX circuits and conventional AND, OR, 2:1 MUX and 4:1 MUX circuits for CMOS feature size of 45nm, 65nm, 90nm, 120nm, 180nm, 0.25µm, 0.35µm, and corresponding supply voltages are 0.5V, 0.7V, 1V, 1.2V, 2V, 2.5V and 3.5V respectively. The logic gats and MUX circuits are designed by using MCIT and conventional (CMOS) techniques. Our MCIT logic gates and 2:1 MUX circuits contains 4 transistors each and 4:1 MUX circuits contains 13 transistors, whereas the conventional logic circuits and 2:1 MUX circuits contains 6 transistors each. The conventional 4:1 MUX circuits contains 15 transistors. Basically, we reduced 2 transistors in each circuit by using MCIT than conventional circuit. The Shannon theorem based adder circuit is slightly better than CMOS and CPL full adder circuits due to regular arrangement of transistor tree structure. The power dissipation, propagation delay and area are found from the simulated results, which gives better performance. The simulated results are taken for varying feature size and its corresponds supply voltages means we have selected the power supply based on the feature size as given in table I. Table II shows the simulated results for ALU circuit designed by Bit slice method and Binvert method using Mixed Shannon adder, Shannon based adder and CPL adder for different feature size. Based on the feature size, we have chosen the corresponding power supply. These two circuits are simulated by varying with scaling voltage of the corresponding feature size. From the table III, it is observed that power dissipation and propagation delay is much lower in Shannon based ALU circuits due to less critical path than mixed Shannon and CPL adder based ALU circuit, but the price paid is larger circuit area. Since, in the Shannon based adder cell the transistors are tree like structure, therefore the electron can pass through the circuit having equal charge distribution. Due to equal charges distribution in the circuit, the critical path of the circuit is minimized

TABLE I: POWER DISSIPATION, PROPAGATION DELAY, AREA THROUGHPUT, LATENCY, EPI AND NUMBER OF TRANSISTOR OF THE AND, OR 2:1 MUX AND 4:1 MUX CIRCUITS

|            | Parameter        | 0.5V<br>45nm | 0.7V<br>65nm | 1.0V<br>90nm | 1.2V<br>120nm | 2V<br>180nm | 2.5V<br>0.25μm | 3.5V<br>0.35µm |
|------------|------------------|--------------|--------------|--------------|---------------|-------------|----------------|----------------|
|            | PD (µW)          | 0.014        | 0.049        | 0.110        | 0.386         | 0.755       | 3.85           | 4.157          |
| AND        | Delay (ps)       | 7            | 9            | 12           | 18            | 20          | 26             | 47             |
|            | Area $(\mu m)^2$ | 28           | 32           | 50           | 66            | 220         | 350            | 880            |
| OR         | PD (µW)          | 0.044        | 0.055        | 0.416        | 0.942         | 0.997       | 1.799          | 6.6            |
|            | Delay (ps)       | 8            | 12           | 18           | 20            | 24          | 30             | 54             |
|            | Area $(\mu m)^2$ | 28           | 32           | 50           | 66            | 220         | 350            | 880            |
| 2:1<br>MUX | PD (µW)          | 0.021        | 0.038        | 0.165        | 0.293         | 0.782       | 1.293          | 1.586          |
|            | Delay (ps)       | 5            | 6            | 8            | 14            | 16          | 22             | 26             |
|            | Area $(\mu m)^2$ | 28           | 32           | 50           | 66            | 220         | 350            | 880            |
| 4:1<br>MUX | PD (µW)          | 0.023        | 2.56         | 3.37         | 6.87          | 8.08        | 16.78          | 25.18          |
|            | Delay (ps)       | 6            | 23           | 25           | 42            | 128         | 293            | 332            |
|            | Area $(\mu m)^2$ | 54           | 96           | 126          | 221           | 638         | 1440           | 3000           |

| TABLE II: 1bit PROPOSED SHAN | NON BASED, CP      | L AND MIXED | SHANNON | ADDER CEL | LLS BASED | l b ALU |
|------------------------------|--------------------|-------------|---------|-----------|-----------|---------|
| CIRCUITS POWER DISSIPATION   | <b>PROPAGATION</b> | DELAY AND A | REA     |           |           |         |

| ALU       | Adder                | Supply                  | 0.5V  | 0.7V  | 1.0V  | 1.2V  | 2V     | 2.5V   | 3.5V   |
|-----------|----------------------|-------------------------|-------|-------|-------|-------|--------|--------|--------|
| type      | type                 | Voltage                 | 45nm  | 70nm  | 90nm  | 120nm | 180nm  | 0.25µm | 035µm  |
| Bit slice |                      | Power µW                | 0.136 | 0.176 | 0.62  | 1.253 | 2.12   | 20.24  | 33.12  |
|           | Shannon              | Delay (ps)              | 44    | 47    | 70    | 130   | 165    | 265    | 307    |
|           |                      | Area (µm <sup>2</sup> ) | 31x8  | 43x11 | 45x13 | 54x16 | 112x31 | 140x39 | 224x62 |
|           | Mixed -<br>Shannon - | Power µW                | 0.721 | 0.75  | 0.888 | 1.333 | 5.4    | 23.52  | 45.76  |
|           |                      | Delay (ps)              | 71    | 23    | 402   | 71    | 301    | 71     | 330    |
|           |                      | Area (µm <sup>2</sup> ) | 26x7  | 37x10 | 39x12 | 47x14 | 96x28  | 120x35 | 192x56 |
|           | CPL                  | Power µW                | 0.4   | 0.871 | 1.8   | 1.852 | 6.96   | 31.62  | 45.6   |
|           |                      | Delay (ps)              | 118   | 90    | 80    | 103   | 420    | 77     | 312    |
|           |                      | Area (µm <sup>2</sup> ) | 29x8  | 40x11 | 42x13 | 50x16 | 104x31 | 130x31 | 208x62 |
|           | Shannon              | Power µW                | 0.434 | 1.145 | 1.115 | 1.173 | 7.52   | 25.38  | 46.8   |
|           |                      | Delay (ps)              | 118   | 114   | 236   | 95    | 236    | 165    | 425    |
|           |                      | Area (µm <sup>2</sup> ) | 30x8  | 42x11 | 44x13 | 53x16 | 110x31 | 138x39 | 220x62 |
|           | Mixed -<br>Shannon - | Power µW                | 0.983 | 2.96  | 3.334 | 0.807 | 5.64   | 35.41  | 49.84  |
| B invert  |                      | Delay (ps)              | 79    | 119   | 24    | 71    | 236    | 166    | 395    |
|           |                      | Area (µm <sup>2</sup> ) | 26x7  | 36x10 | 38x18 | 46x14 | 94x12  | 118x35 | 188x56 |
|           | CPL                  | Power µW                | 0.622 | 3.92  | 0.584 | 0.906 | 8.52   | 48.6   | 77.48  |
|           |                      | Delay (ps)              | 118   | 47    | 520   | 213   | 118    | 331    | 142    |
|           |                      | Area (µm <sup>2</sup> ) | 28x8  | 39x11 | 41x17 | 49x16 | 102x31 | 128x39 | 204x62 |

# TABLE III 180nm 32-bit ALU

| ALU<br>type | Adder   | Power<br>Dissipation<br>(mW) | Delay<br>(ps) | Area<br>(µm) <sup>2</sup> | Throughput<br>G bits/Sec | Latency<br>(ns) |
|-------------|---------|------------------------------|---------------|---------------------------|--------------------------|-----------------|
| Bit         | Shannon | 4.4278                       | 434           | 946x106                   | 2.109                    | 14.137          |
| slice       | MS      | 6.6172                       | 459           | 965x74                    | 2.173                    | 14.806          |
| method      | CPL     | 33.160                       | 506           | 964x105                   | 1.824                    | 17.499          |
| D:          | Shannon | 2.9899                       | 465           | 967x102                   | 1.979                    | 16.130          |
| Binvert     | MS      | 7.5758                       | 498           | 964x73                    | 1.837                    | 17.378          |
| method      | CPL     | 28.474                       | 728           | 966x102                   | 1.290                    | 24.746          |

# TABLE: IV COMPARISON DELAY AND EPI OF OUR PROPOSED CIRCUITS

| Author            | Year                      |                      | ALU type      | Delay<br>ps | % of reduction | EPI pJ  | % of reduction |
|-------------------|---------------------------|----------------------|---------------|-------------|----------------|---------|----------------|
|                   |                           |                      | Shannon       | 434         |                | 221.779 |                |
|                   |                           |                      | MS            | 459         |                | 221.130 |                |
| Our proposed      |                           |                      | CPL           | 506         |                | 222.427 |                |
| technique         |                           |                      | Shannon       | 465         |                | 223.076 |                |
|                   |                           |                      | MS            | 498         |                | 221.130 |                |
|                   |                           |                      | CPL           | 728         |                | 221.779 |                |
| Chatterjee [8]    | Chatterjee [8] 2005 180nm |                      | 32 bit        | 680         | 36.17%         |         |                |
| Mark [13] 2002 18 |                           | 180nm                | 32bit         | 500         | 13.2%          |         |                |
|                   |                           | <u>180nm</u><br>65nm | i486          |             |                | 10 nJ   | 97.78%         |
|                   |                           |                      | Pentium       |             |                | 14 nJ   | 98.41%         |
|                   | 2006                      |                      | Pentium pro   |             |                | 24 nJ   | 99.07%         |
| Murali [12]       |                           |                      | Pentium 4     |             |                | 38 nJ   | 99.41%         |
|                   |                           |                      | Pentium 4(C). |             |                | 48 nJ   | 99.53%         |
|                   |                           |                      | Pentium 4(D)  |             |                | 15 nJ   | 98.52%         |
|                   |                           |                      | Core Duo      |             |                | 11 nJ   | 97.98%         |

The two 32 bit ALU circuit's layouts are simulated using standard feature size of 180nm. The simulated bit slice Shannon theorem based ALU circuit uses 1632transistors. The 32 bit proposed Shannon adder based ALU circuits gives less power dissipation compared with mixed Shannon adder based ALU and CPL adder based ALU circuits. The Shannon based ALU circuit's gives 33.08% lower power consumption than mixed Shannon and 86.64% than CPL adder based ALU circuits. The propagation delay of the our Shannon adder cell based ALU circuits, designed by using bit slice method, is approximately 5.4% lower than mixed Shannon adder based ALU circuit and 14.22% less than CPL adder based ALU circuits. Similarly our Shannon adder based ALU circuits is dominates in terms of EPI, Throughput and Latency. The total chip area of our Shannon adder cell based ALU circuits is about 40.42% more than mixed Shannon and approximately 1% more than CPL adder circuits. The proposed Shannon adder based ALU circuit's give better performance in terms of power, delay, EPI, throughput and latency, which is clearly shown in the Table IV

The proposed Shannon adder based ALU circuit gives 36.17% less delay than [10] circuit due to less critical path. The proposed Shannon adder based ALU circuit results shows 13.2% reduction in delay from Mark et.al [12] circuit and Murali et.al [13] circuit. The percentage of reduction in EPI is varying from 97.78% to 99.53%.

The parasitic capacitance of the 32 bit ALU circuit is analyzed for against power dissipations

and output current that are shown in Fig. 2(a) and Fig. 2(b). The transconductance of the 6 metal layer feature size (180nm) is yielding total value of 100fF/mm capacitance. The power dissipation measured from input (Vin) poly/poly2 to output (V<sub>out</sub>) poly4/poly6. Due to equal tree structure, the leakage current will be reduced; therefore power consumption of the Shannon adder based ALU circuit is less in Bit Slice as well as Binvert method compared to other adder based ALU circuits. Supply voltage is very effective in reducing the power dissipation due to quadratic dependency of switching power on supply voltage and linear dependence of sub-threshold leakage. In these proposed two adders critical path replica is used to predict the performance while body biasing is used to tune the threshold voltage of the actual circuit [11]. This is because of storing capacitor, called C<sub>store</sub>, appearing at the output. The parasitic capacitance always exists due to diffusion areas of the p-channel MOS and n-channel MOS devices. However, Cstore includes a supplementary capacitors connected to node Vin with capacitance value sufficiently high to counterbalance the effects of leakage currents. The supply voltage can be scaled down (while maintaining the clock frequency) to utilize the timing slack available between the critical paths and the longest off-critical paths. The off-critical paths are evaluated in rated frequency while the infrequent critical paths are evaluated in two-clock cycles. Due to less critical path in the proposed technique, the ALU get less power dissipation and low leakage current. Figs 2(c) and 2(d) show the minimum power dissipation and low leakage current in the proposed circuits.



Fig. 2(a).32 bit ALU capacitance versus Power dissipation (feature size 180mm), Fig.2(b) 32 bit ALU capacitance versus output current Fig.2 (c) Supply Voltage versus Max Power Dissipation, Fig.2(d) Supply voltage versus Operating current

#### Conclusion

The 32 bit ALU circuit was designed using by bit slice and Binvert methods, which is implemented by using proposed Shannon theorem based adder circuit, mixed Shannon based adder circuit and CPL based adder circuit. The simulated results were compared with other existing ALU circuits and it was observed, Shannon based adder ALU circuit gives lower power dissipation, low propagation delay, high throughput, than other adder based ALU circuits due to less critical path. This Shannon based adder ALU circuit may be used in high performance, high speed multimedia application circuits.

#### References

 Glasser, Lance A. and Dobberpuhl, Daniel W., (1985) "The Design and Analysis of VLSI Circuits" Addison-Wesley publication.

- [2] Najm. F., (1994) "A survey of power estimation techniques in VLSI circuits," IEEE Transactions on VLSI Systems, vol. 2, pp. 446-455.
- [3] C.Senthilpari, Ajay Kumar Singh and K.Diwakar "design of a low power, high performance, 8x8 bit multiplier using a Shannon-based adder cell" Microelectronics Journal 39 (2008) 812–821.
- [4] Ge Yang, Seong-Ook Jung, Kwang-Hyun Baek, Soo Hwan Kim, Suki Kim, and Sung-Mo Kang (2005) "A 32-Bit Carry Lookahead Adder Using Dual-Path All-N Logic" IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol.13, No.8.pp. 992-996.
- [5] Markovic'. D., B. Nikolic and V.G. Oklobdzija (2000) "A general method in synthesis of pass-transistor circuits" Microelectronics Journal 31 pp. 991–998.

- Bui. H. T, Y. Wang, and Y. Jiang (2002) "Design and analysis of low-power 10transistor full adders using novel X-OR-XNOR gates," IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process, vol. 49, no. 1, pp. 25–30.
- Shen-Fu Hsiao, Ming-Yu Tsai, Ming-Chih Chen, and Chia-Sheng Wen (2005) "An Efficient Pass-Transistor-Logic Synthesizer Using Multiplexers and Inverters Only" IEEE International Symposium on Circuits and Systems, pp-2433-2436.
- [8] Bhaskar Chatterjee and Manoj Sachdev "Design of a 1.7-GHz Low-Power Delay-Fault-Testable32-b ALU in 180-nm CMOS Technology" IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol 13, No. 11, November 2005,pp1296-1304.
- [9] Kiseon Cho and Minkyu Song "Design Methodology of a 32-bit Arithmetic Logic Unit with an Adaptive Leaf-cell Based Layout Technique" VLSI Design, 2002 Vol. 14 (3), pp. 249–258.
- [10] Gustavo a, Ruiz "Evaluation of three 32-bit CMOS adders in DCVS logic for self-timed circuits", IEEE Journal of Solid-State Circuits 33(4), (1998)604–613.
- [11] Swaroop Ghosh, Kaushik Roy. (2008) "Exploring high-speed low-power hybrid arithmetic units at scaled supply and adaptive clock-stretching". ASPDAC 2008, pp 635-640.
- [12] Ed Grochowski and Murali Annavaram "Energy per Instruction Trends in Intel Microprocessors" Technology Intel Magazine March 2006, pp 1- 8.
- [13] Mark Anders, Sanu Mathew, Brad Bloechel, Scott Thompson Ram Krishnamurthy, K. Soumyanath, Shekhar Borkar "A 6.5GHz 130nm Single-Ended Dynamic ALU and Instruction-Scheduler Loop" IEEE International Solid-State Circuits Conference 0-7803-7335-9, 2002.