# Transition Faults Detection in Bit Parallel Multipliers over $G F\left(2^{m}\right)$ 

Hafizur Rahaman<br>Bengal Engineering \& Science University, Shibpur<br>Howrah-711103, India<br>rahaman_h@it.becs.ac.in<br>Jimson Mathew<br>Computer Science Department, University of Bristol BS81UB, UK,<br>jimson@cs.bris.ac.uk<br>Ashutosh K. Singh<br>CS Dept., School of Engineering,<br>Curtin University of Technology, Malaysia<br>ashutosh.s@curtin.edu.my<br>Dhiraj K. Pradhan, IEEE/ACM Fellow<br>Computer Science Department, University of Bristol BS81UB, UK, pradhan@cs.bris.ac.uk<br>Biplab K. Sikdar<br>Bengal Engineering \& Science University, Shibpur<br>Howrah-711103, India<br>biplab@cs.becs.ac.in

Abstract: - In this article, a C-testable design for detecting transition faults in the polynomial basis (PB) bit parallel (BP) multiplier circuits over $G F\left(2^{m}\right)$ is discussed. For 100 percent transition fault coverage, the proposed technique requires only 10 vectors, irrespective of multiplier size, at the cost of 6 percent extra hardware. The proposed constant test vectors which are sufficient to detect both the transition and stuck-at faults in the multiplier circuits can be derived directly without any requirement of an ATPG tool. As the $\mathrm{GF}\left(2^{\mathrm{m}}\right)$ multipliers have found critical applications in public key cryptography and need secure internal testing, a Built-in Self-Test (BIST) circuit may be used for generating test patterns internally. This will obviate the need of having three extra pins for the control inputs and also provides public-key security in cryptography. Area and delay of the testable circuit are analyzed using Synopsys ${ }^{\circledR}$ tools with $0.18 \mu$ CMOS technology library from UMC.

Key-Words: - Transition fault, Galois field, multiplier, cryptography, error control code, VLSI testing.

## 1 Introduction

Failures that cause logic circuits to malfunction at the desired clock rate and violate timing specifications are modeled as delay faults. Gate delay $[9-10]$ and path delay fault model [11-12] have generally been used to model delay defects. Path delay fault model is more comprehensive in modeling delay defects, but often difficult to use in
practice due to large number of paths in large circuits. Gate delay fault model is more practical for large circuits. The most commonly used gate delay fault model is transition fault model [9] which is considered as a logical model for a defect that delays a rising or falling transition at inputs and outputs of logic gates. There are two kinds of transition faults: slow-to-rise and slow-to-fall. Slow-to-rise (fall) transition fault temporarily behaves like
a stuck-at-0 (1) fault. Testing of such faults requires two-pattern tests, each consisting of an initialization and a test vector. Several works have been reported on generating tests for transition faults [9-15]. Here, we present a C -testable technique for detecting transition faults in PB multipliers over $G F\left(2^{m}\right)$.
Arithmetic operations in finite or Galois fields of form $G F\left(2^{m}\right)$ have gained wide spread uses in public-key cryptography, error detecting and correcting code [2], VLSI testing [1], digital signal processing [17]. There are two basic arithmetic operations over finite fields: addition and multiplication. While addition over $G F\left(2^{m}\right)$ can be implemented with just $m$ 2-input EXOR gates, multiplication is much more complex. Note that other operations like exponentiation, division, and inversion over $G F\left(2^{m}\right)$ can be performed by repeated multiplications [1-5]. Various techniques exist for optimal design of multipliers over $G F\left(2^{m}\right)$ with respect to complexities, delay, and power. Most techniques focused on VLSI implementation and synthesis of these multipliers because VLSI implementations of these circuits are very complicated due to complex routing, non-modular architecture and low testability. The C-testable systolic array design for Galois-Field inversion was proposed in [16], which requires 32 vectors to detect the faults in the circuit and extra hardware overhead to achieve C-testability. A testable implementation of BP multiplier over $G F\left(2^{m}\right)$ which requires a function independent test set of length $(2 m+4)$ or detecting stuck-at faults in multipliers over $\operatorname{GF}\left(2^{m}\right)$ was reported in [8]. To date, testability issue of $G F$ multiplier has not fully been explored. In view of this fact, we present a C-testable design of PB multipliers over $G F\left(2^{m}\right)$. This design requires only 10 constant vectors to detect transitions faults as well as all the single stuck-at faults. This test length is lower than that generated by Synopsys ATPG tools (Tetramax ${ }^{T M}$ ). Finally, we analyse the area, delay, and power using $0.18 \mu$ CMOS library from UMC.

## 2. Preliminaries

Let $G F(N)$ denote a set of $N$ elements, where $N$ is a power of a prime number, with two special elements 0 and 1 representing the additive and multiplicative identities respectively, and two operators addition ' + ' and multiplication ' $\because G F(N)$ defines a finite field, if it forms a commutative ring with identity over these two operators in which every element has a multiplicative inverse. Finite fields can be
generated with primitive polynomials of the form $p(x)=x^{m}+\sum_{i=0}^{m-1} c_{i} x^{i}$, where $c_{i} \in G F(2)$ [5]. It is conventional to represent the elements of $G F\left(2^{m}\right)$ as a power of the primitive element $\alpha$ where $\alpha$ is the root of $P(x)$, i.e. $p(\alpha)=0$. The set $\left\{1, \alpha, \ldots, \alpha^{m-1}\right\}$ is referred to as the polynomial basis. Each element $A \in G F\left(2^{m}\right)$ can be expressed with respect to the PB as a polynomial of degree ( $m-1$ ) over $G F(2)$, i.e. $A(x)=\sum_{i=0}^{m-1} a_{i} x^{i}$ where $a_{\mathrm{i}} \in G F(2)$. Given $\mathrm{A}, \mathrm{B} \in G F\left(2^{m}\right)$, the PB multiplication over $G F\left(2^{m}\right)$ can be defined as $C(x)=A(x) \cdot B(x) \bmod P(x)$. Details can be found in [2,6].

Mastrovito has proposed an algorithm along with its hardware architecture for PB multiplication [5] popularly known as the Mastrovito algorithm/multiplier. A formulation for Polynomial basis multiplication and generalized bit-parallel hardware architecture for special reduction polynomials, namely: trinomials, equally spaced polynomials (ESPs), and two classes of pentanomials has been presented in [4]. This formulation is described below. Consider a multiplier with A and B inputs where $A=\left[a_{0}, a_{1}, a_{2}\right.$, $\left.\ldots, a_{m-1}\right]$ and $B=\left[b_{0}, b_{1}, b_{2}, \ldots, b_{m-1}\right]$. The $a_{i}$, and $b_{i}$ are the coordinates of $A$ and $B$ respectively where $0 \leq i \leq$ $m-1$. The multiplication outputs are given in the equation (1).
$c=d+Q^{T} e$
$d=L \times b$
$e=U \times b$
where $b=B^{T}=\left[b_{0}, b_{1}, b_{2}, \ldots, b_{m-1}\right]^{T}$, and
$L=\left[\begin{array}{ccccccc}a_{0} & 0 & & 0 & 0 & \ldots & 0 \\ a_{1} & a_{0} & & 0 & 0 & \ldots & 0 \\ a_{2} & a_{1} & & a_{0} & 0 & \ldots & 0 \\ \vdots & & \ddots & \ddots & & \vdots \\ a_{m-2} & a_{m-3} & \ldots & a_{1} & a_{0} & 0 \\ a_{m-1} & a_{m-2} & \ldots & a_{2} & a_{1} & & a_{0}\end{array}\right]$
$U=\left[\begin{array}{cccccc}0 & a_{m-1} & a_{m-2} & \ldots & a_{2} & a_{1} \\ 0 & 0 & a_{m-1} & \ldots & a_{3} & a_{2} \\ \vdots & \ddots & \ddots & & \vdots & \\ 0 & 0 & \ldots & 0 & \cdots & a_{m-1} \\ 0 & 0 & \ldots & 0 & \cdots & 0\end{array} a_{m-2}\right.$.

We have derived Q matrix for the multiplier over $G F\left(2^{4}\right)$ from the first principles in example 1. The architecture for this implementation is shown in Fig. 1. This structure is divided into two parts: the Inner Product (IP)-network and Q-network. Q-network part is also EXOR tree. The IP-network, which has $m$ blocks, generates $d$ and $e$. For $\{0 \leq i \leq m-2\}$, each block constitutes two inner product cells, namely, $I P(i+1)$ and $I P(m-i-1)$. However the last block constitutes only one such cell $I P(m)$. The Q-network takes $d$ and $e$ as inputs and generates $c$. It constitutes $m$ binary trees of EXOR gates ( $B T X_{o}, B T X_{l}, \ldots$, $\left.B T X_{m-1}\right)$. The inputs d and e feed to the BTX trees. For the multiplier structure shown in Fig. 1, the IPnetwork has a total of $m^{2}$ AND gates and $(m-1)^{2}$ EXOR gates. The maximum number of EXOR gates required for the Q-network depends on the Qmatrix. The multiplier structure is the multiple outputs Positive Polarity Reed-Muller (PPRM)-like form.


Fig. 1: Architecture of the BP multiplier over $G F\left(2^{* w}\right)$
Definition 1: A Boolean AND-EXOR function $F\left(x_{1}\right.$, $x_{2}, \ldots, x_{n}$ ) is in the PPRM if only positive polarity is allowed for each input variable, i.e. each variable appears in its uncomplemented form throughout. For example $F_{1}=x_{1} x_{2} \oplus x_{2} x_{3}$ is a PPRM. Several testable techniques for AND-EXOR circuits have appeared in [7].
Example 1: A multiplier structure over $G F(16)$ defined by the primitive polynomial $P(x)=x^{4}+x^{3}$ +1 is shown in Fig. 2.

The two inputs of the multiplier are $A=\left(a_{0}, a_{1}, a_{2}\right.$, $\left.a_{3}\right)$ and $B=\left(b_{0}, b_{1}, b_{2}, b_{3}\right)$. The polynomial representation of $G F\left(2^{4}\right)$ elements is as follows.
$\mathrm{A}(\mathrm{x})=a_{0}+a_{1} \mathrm{x}+a_{2} \mathrm{x}^{2}+a_{3} \mathrm{x}^{3}, \mathrm{~B}(\mathrm{x})=\mathrm{b}_{0}+b_{1} \mathrm{x}+b_{2} \mathrm{x}^{2}+$ $\mathrm{b}_{3} \mathrm{x}^{3}$, where $\mathrm{A}, \mathrm{B} \in \mathrm{GF}\left(2^{4}\right)$.

The product $C(x)=A(x) \times B(x)$.

Now, $C(x)=\left(a_{0}+a_{1} \mathrm{x}+a_{2} \mathrm{x}^{2}+a_{3} \mathrm{x}^{3}\right) \times\left(b_{0}+b_{1} x+b_{2} x^{2}\right.$ $\left.+b_{3} x^{3}\right)=a_{0} b_{0}+\left(a_{0} b_{1}+a_{0} b_{1}\right) x+\left(a_{0} b_{2}+a_{1} b_{1}+a_{2} b_{0}\right) x^{2}$ $+\left(a_{0} b_{3}+a_{1} b_{2}+a_{2} b_{1}+a_{3} b_{0}\right) \mathrm{x}^{3}+\left(a_{1} b_{3}+a_{2} b_{2}+\right.$ $\left.a_{3} b_{1}\right) \mathrm{x}^{4}+\left(a_{2} b_{3}+a_{3} b_{2}\right) \mathrm{x}^{5}+a_{3} b_{3} \mathrm{x}^{6}$.


Fig.2: Architecture of the BP multiplier over $\mathrm{GF}\left(2^{4}\right)$
Let us denote the lower order m coefficients as $d_{0}$, $d_{l}, \ldots, d_{m-1}$ and the higher order $\{m-l\}$ coefficients as $\mathrm{e}_{0}, e_{l}, \ldots, \mathrm{e}_{m-2}$. Then $C(x)$ can be expressed in equation (4).
$C(x)=d_{0}+d_{1} x+d_{2} x^{2}+d_{3} x^{3}+e_{0} x^{4}+e_{1} x^{5}+e_{2} x^{6}$
Here, we define product over the primitive polynomial $P(x)=\mathrm{x}^{4}+\mathrm{x}^{3}+1$ as $A(x) B(x) \bmod P(x)$. Hence, we have,
$x^{4}=x^{3}+1, x^{5}=x\left(x^{3}+1\right)=x^{4}+x=x^{3}+x+1, x^{6}=$ $x\left(x^{5}\right)=x\left(x^{3}+x+1\right)=x^{4}+x^{2}+x=x^{3}+1+x^{2}+x=$ $x^{3}+x^{2}+x+1$.
Substituting the power of $x^{4}, x^{5}, x^{6}$ and simplifying we get,
$A(x) B(x) \bmod P(x)=C$

$$
=\left(e_{0}+e_{1}+e_{2}+d_{0}\right)+\left(e_{1}+e_{2}+d_{1}\right)
$$

$x+\left(e_{2}+d_{2}\right) x^{2}+\left(e_{0}+e_{1}+e_{2}+d_{3}\right) x^{3}$
The above modulo reductions can be represented in the matrix form as given below.

$$
\begin{gathered}
{\left[\begin{array}{l}
c_{0} \\
c_{1} \\
c_{2} \\
c_{3}
\end{array}\right]=\left[\begin{array}{lll}
1 & 1 & 1 \\
0 & 1 & 1 \\
0 & 0 & 1 \\
1 & 1 & 1
\end{array}\right]\left[\begin{array}{l}
e_{0} \\
e_{1} \\
e_{2}
\end{array}\right]+\left[\begin{array}{l}
d_{0} \\
d_{1} \\
d_{2} \\
d_{3}
\end{array}\right]} \\
c=Q^{T} e+d, \text { where } e=\left[\begin{array}{l}
e_{0} \\
e_{1} \\
e_{2}
\end{array}\right] d=\left[\begin{array}{l}
d_{0} \\
d_{1} \\
d_{2} \\
d_{3}
\end{array}\right]
\end{gathered}
$$

We can also derive d , e from the equations (1), (2).

## 3 Proposed Technique

A test for a transition fault is a pair of input patterns, one known as initialisation vector to set up the initial state for the transition and another known as propagation or test vector to cause the appropriate transition and observe its effect at a primary output. The test vector is identical to a pattern that detects the corresponding stuck-at fault. The transition fault coverage is a measure of effectiveness of the delay test in detecting large delay variations. A test pair $<v_{1}, v_{2}>$ is required to detect the transition fault $f$ on a signal line. The initial vector $v_{l}$ must set the target node to an initial value $0[1]$ for slow-to-rise [slow-to-fall] fault. The test vector $\mathrm{v}_{2}$ has to launch the corresponding transition at the target node and also propagate the fault effect to the primary output. Thus, $\mathrm{v}_{2}$ is a test for $\mathrm{s}-\mathrm{a}-0$ [ $\left.\mathrm{s}-\mathrm{a}-1\right]$ fault if the transition fault is the slow-to-rise [slow-to-fall] fault. To achieve C-testability for detection of transition faults in Bit parallel GF multipliers, the multipliers architecture as shown in Fig. 1 has been augmented. The AND part of $I P$ network are modified with 3 control lines $k_{0}, k_{l}$ and $k_{2}$. All two inputs AND gates have been replaced by three inputs AND gates. The proposed design is shown in Fig.3.


Fig. 3. C-Testable GF Multiplier

Definition 2: A circuit is C-testable if it can be tested with a constant number of vectors independent of the circuit's complexity.

For the detection of the transition faults at any input node of EXOR gate, two transitions $0 \rightarrow 1$ and $1 \rightarrow 0$ are essential. An EXOR-tree of single output can be tested for all single transition faults by five ( $2 m+3$ )bit function-independent tests applied to the inputs of a single-input AND-EXOR circuit. Three control inputs $k_{0}, k_{1}, k_{2}$ are used to achieve this. This scheme will allow us to apply each of the two transitions
$(0 \rightarrow 1,1 \rightarrow 0)$ to the inputs of each 2 -input EXOR gate in the tree. This is based on the following observation. In Fig. 5a, the two sequence q: 01100 and $\mathrm{r}: 01010$ arriving at the two inputs to the last EXOR gate generate the output sequence $s$ : 00110 . Similarly, $q: 01100$ and $s: 00110$ arriving at the two inputs of an EXOR gate will generate the output sequence $r$ : 01010. Again, input sequences $r$ : 01010 and $s: 00110$ will generate $\mathrm{q}: 01100$ as output. There exist the following relations among the vectors $(q, r$, $s): \quad q \oplus r=s, q \oplus s=r, r \oplus s=q$. Applying sequence q: 01100 , two transitions i.e. $0 \rightarrow 1$ and $1 \rightarrow 0$ will be achieved at any input node of EXOR gate. For the sequence $\mathrm{r}: 01010$, the transitions $0 \rightarrow 1$ and $1 \rightarrow 0$ are achieved at any input node of EXOR gate. Again for sequence s: 00110 , two transitions $0 \rightarrow 1$ and $1 \rightarrow 0$ at the input node are generated. Applying these sequences $\mathrm{q}, \mathrm{r}, \mathrm{s}$, two transitions: $0 \rightarrow 1$ and $1 \rightarrow 0$ are achieved at every input node of the EXOR gate in the EXOR tree.


Fig. 4. Test vectors and responses in an EXOR-tree

Example 2: In the EXOR tree shown in Fig. 4, we assign sequence vectors $q, r, s, q, r, s, q, r, \ldots$ (by repeating the pattern $(q, r, s)$ to the inputs of the EXOR tree from left-to right until all of them are assigned. The outputs of the first level are propagated down to the root i.e. final output of the tree. Thus, each EXOR gate in the tree receives the desired input combination from the above five combinations. The five constant test vectors that are to be applied to the inputs of the tree of Fig. 4 are shown as a matrix $T_{\text {tree }}$. This matrix has five
(constant) rows and $y$ columns, where $y$ is the number of leaf nodes of the tree, and is equal to the number of AND outputs $\left(\mathrm{m}^{2}\right)$ in the multiplier circuit. The columns of the matrix, if seen from left-to-right, will correspond to the sequence vectors: $q$, $r, s, q, r, s, q, r$, and so on. The number of distinct columns in the matrix is only three (constant), regardless of the size of the tree.

Since the EXOR-tree is embedded in the overall design of the single output AND-EXOR circuits, the inputs of the tree are not directly accessible. In the IP network as shown in Fig.3, each AND output feeds an EXOR input. Hence, by applying the following five vectors $\mathrm{v}_{1}, \mathrm{v}_{2}, \mathrm{v}_{3}, \mathrm{v}_{4}, v_{5}$ to the primary inputs of Fig. 3, all the three sequences $q, r, s$ can be produced at the outputs of the AND-part.


Fig. 5. EXOR-tree with a control level

A control level with three-control inputs $k_{0}, k_{1}$ and $k_{2}$ as shown in Fig. 5. By setting these control inputs to 1 , the original function can be obtained. In this design, the AND outputs are partitioned into 3 groups based on sequence vectors $q, r, s$. The output lines of the AND gates connected with $k_{0}, k_{1}$ and $k_{2}$ control lines receive the sequence vector $q: 01100$, $r: 01010$, and $s: 00110$ respectively. No additional
hardware is essential for this testable design. Only all the two inputs AND gates in IP-network have been replaced by three inputs AND gates.

The technique we have discussed above is applicable to single output AND-EXOR circuits. In this section we extend this idea to multiple output AND-EXOR circuits. To achieve $100 \%$ testability in multiplier circuits, the inputs of the EXOR gates of the IP- and networks will be properly mapped. We assume that the IP-network would generate the following sequence from left-to-right: $q, r, s, q, \ldots$, $q, r, s, q \ldots$ and so on at the outputs $e_{j}$, where $0 \leq j \leq$ $m-2$. To propagate these $e_{j}$ outputs of the IPnetwork at the outputs of the Q-network, the $d_{i}$ outputs, where $0 \leq i \leq m$, will be properly mapped with the sequences $q, r$, and $s$. The following algorithms outline this process.

Step-1: Assignment of sequences $q, r$, and $s$ to $e_{j}$, where $0 \leq j \leq m-2$.

```
Algorithm_seq_assignment _e
    for \((j=2 ; j<=m ; j++)\)
        \{
        \(e(m-j)=q ;\)
        \(e(m-(j+1))=r ;\)
        \(e(m-(j+2))=s\);
        \}
```

Example 3: For the multiplier circuit over $\mathrm{GF}\left(2^{4}\right)$ of Fig.2, the $\mathrm{e}_{\mathrm{j}}(0 \leq j \leq 2)$ are assigned as $e_{2}=q, e_{1}=r, e_{0}=s$.

Step-2: Assignment of the sequences $q, r$, and $s$ to $d_{i},(0 \leq i \leq m-1)$.

Condition 1: After assigning the sequences at the $e_{i}$ nodes in step- 1 , the sequences at $d_{i}$ nodes $(0 \leq i \leq m$ 1) are to be assigned in such a way that no two input nodes of each EXOR gate receive same sequence vector in the Q-networks.

Example 4: Consider tree representation of $\mathrm{BTX}_{3}$ block of Q-network as shown in Fig. 6a. In step-1, the nodes $e_{2}, e_{1}, e_{0}$ are already assigned with $q, r$, and $s$ respectively. In Fig 6a, $e_{2}$ and $e_{1}$ will generate $s$ sequence at $y_{2}$ node. To propagate the signal value considering condition 1 , at $c_{3}$ node, $q$ (or $r$ ) is to be assigned at $y_{l}$ node. As $e_{0}$ is already assigned with $s$, then r (or q ) is to be assigned with $d_{3}$ to generate $q$ (or r ) at $y_{l}$ node. In this way, the sequences are to be assigned to the $d_{\mathrm{i}}$ nodes, where $0 \leq i \leq m-1$.


Fig.6a, Tree representation of $B T X_{3}$ block


Fig. 7. Testable Design of GF(16) Multiplier


Fig.6b, Tree representation of $I P d_{3}$

Step-3: Assignment of the sequences $q, r$, and $s$ to the internal nodes of the IP-network

Condition 2: After assigning the sequences at $d_{i}$ 's and $e_{j}$ 's, assign the input nodes of EXOR gates in IP-network with proper sequence vectors so that no two inputs of an EXOR gate receive same sequence vector.

In example 4, r (or q ) is assigned to $\mathrm{d}_{3}$ node to generate q (or r ) sequence at $y_{1}$ as $\mathrm{e}_{0}$. Similarly, considering all the other BTX blocks, $s$ (or $r$ ), $r$ (or $q)$, and $q\left(\right.$ or $r$ ) are to be assigned to $d_{2}, d_{1}, d_{0}$ respectively. After assigning $e_{j}(0 \leq j \leq 2)$ and $d_{i}(0 \leq$ $i \leq 3$ ) nodes, every EXOR gate in IP-network is mapped. If the test sequence $v_{1}, v_{2}, v_{3}, v_{4}, v_{5}$ is applied, then the output lines of the AND gates connected with control lines $k_{0}, k_{1}$ and $k_{2}$ control lines receive the sequence vector $q: 01100, r: 01010$, and $s: 00110$ respectively.

Example 5: Consider $I P d_{3}$ block of IP-network of Fig. 2, as shown in the Fig.6b. The $d_{3}$ node is already assigned with either $q$ or $r$. If the sequence q is assigned to $d_{3}$, then r (or s ) and s (or r ) will be to be assigned to $g_{5}$ and $g_{6}$ nodes respectively. If $r$ is assigned to $d_{3}$, then q (or s ) and s (or q ) will be to be assigned to $g_{5}$ and $g_{6}$ nodes respectively. Suppose, we consider $\mathrm{d}_{3}=\mathrm{q}$, and $\mathrm{g}_{5}=\mathrm{s}, \mathrm{g}_{6}=\mathrm{r}$. Again, to generate s at $\mathrm{g}_{5}, \mathrm{q}$ and r are to be assigned to $\mathrm{g}_{1}, \mathrm{~g}_{2}$ nodes respectively. Similarly, to generate $r$ sequence at $\mathrm{g}_{6}$ node, s and q are to be assigned to $\mathrm{g}_{3}, \mathrm{~g}_{4}$ nodes respectively. Now we have $g_{1}=\mathrm{q}, \mathrm{g}_{2}=\mathrm{r}, \mathrm{g}_{3}=\mathrm{s}, \mathrm{g}_{4}=$ q . To generate q sequence at $\mathrm{g}_{1}$ node, one input of the $\mathrm{x}_{1}$ AND gate is to be connected to $\mathrm{k}_{0}$. Similarly, to generate $\mathrm{r}, \mathrm{s}, \mathrm{q}$ sequences at $\mathrm{g}_{2}, \mathrm{~g}_{3}, \mathrm{~g}_{4}$ nodes respectively, one input of the $x_{2}, x_{3}, x_{4}$ AND gates is to be connected to $\mathrm{k}_{1}, \mathrm{k}_{2}$, $\mathrm{k}_{0}$ control inputs respectively. In this way, the input nodes of the IPd blocks are assigned with the proper sequence vectors by selecting the proper connection of control inputs $k_{0}, k_{1}$, and $k_{2}$.

Example 6: The internal mapping of the interconnections in the IP- and Q-networks of Example 2 is shown in Fig.7, which is also the testable design of the multiplier designed from the primitive polynomial $P(x)=x^{4}+x^{3}+1$.

Transition Fault in EXOR part: The values of q, r, s at a particular instant are shown in the table below.

Table 1: $\mathrm{q}, \mathrm{r}, \mathrm{s}$ sequence diagram

| Instant | $\mathrm{t}_{1}$ | $\mathrm{t}_{2}$ | $\mathrm{t}_{3}$ | $\mathrm{t}_{4}$ | $\mathrm{t}_{5}$ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| q | 0 | 1 | 1 | 0 | 0 |
| r | 0 | 1 | 0 | 1 | 0 |
| s | 0 | 0 | 1 | 1 | 0 |

Case-1: When q and r sequence are applied at the
inputs of an EXOR gate (Table 1).
Slow-to-rise transition i.e. transition from $\mathrm{t}_{1}$ - to- $_{2}$ at $q$ input node will produce 1 instead of 0 at the output of EXOR gate because the input of $r$ input node changes its state, but due to slow-to-rise transition at q node, the input value retains its previous state at q node and shows 1 . Slow-to-rise transition i.e. transition from $t_{1}-t o-t_{2}$ at r input node will produce 1 instead of 0 at the output of EXOR gate because the input of q input node changes its state, but due to slow-to-rise transition at r node, the input value retains its previous state at $r$ node and shows 1. Slow-to-fall transition i.e. transition from $\mathrm{t}_{3}$-to- $\mathrm{t}_{4}$ at q input node will produce 0 instead of 1 at the output of EXOR gate. Slow-to-fall transition i.e. transition from $t_{2}$-to- $t_{3}$ at $r$ node will produce 0 instead of 1 at the output of EXOR gate i.e., the output retains its previous value 0 .

Case-2: When r and s sequence are applied at the inputs of an EXOR gate (Table 1).

Slow-to-rise transition i.e. transition from $t_{1}-$ to- $t_{2}$ at $r$ input node will produce 0 instead of 1 at the output of EXOR gate i.e. the output retains its previous value 0 . Slow-to-rise transition i.e. transition from $\mathrm{t}_{2}$-to- $\mathrm{t}_{3}$ at $s$ input node will produce 0 instead of 1 at the output of gate because the input value at $r$ node changes its state, but due to slow-torise transition at $s$, the input at $s$ node retains value of $t_{2}$ and output shows 0 . Slow-to-fall transition i.e. transition from $\mathrm{t}_{2}$-to- $\mathrm{t}_{3}$ at $r$ node will produce 0 instead of 1 at the output of gate because input of $s$ node changes its state, but due to slow-to-fall transition at $r$ node, the input value retains its previous state at $r$ node and shows 0 . Slow-to-fall transition i.e. transition from $\mathrm{t}_{4}$-to- $\mathrm{t}_{5}$ at $s$ node will produce 1 instead of 0 at the output of EXOR gate because the input of $r$ node changes its state, but due to slow-to-fall transition at s , the input of s node does not change and output shows 1 .

Case-3: When q and s sequence are applied at the inputs of an EXOR gate (Table 1).

Slow-to-rise transition i.e., transition from $t_{1}$-to- $t_{2}$ at $q$ input node will produce 0 instead of 1 at the output of gate i.e. the output retains previous value 0 . Again, slow-to-rise transition i.e. transition from $\mathrm{t}_{2}$-to- $\mathrm{t}_{3}$ at $s$ input node will show 1 instead of 0 at the output of the gate i.e. output retains previous output 1. Slow-to-fall transition i.e. transition from $\mathrm{t}_{3}$-to- $\mathrm{t}_{4}$ at $q$ node will show 0 instead of 1 at the output i.e. output retains its previous value 0 .

Similarly, slow-to-fall transition i.e., transition from $\mathrm{t}_{4}$-to- $\mathrm{t}_{5}$ at s node will show 1 instead of 0 at the output i.e. the output retains it previous value 1 .

Lemma 1: The vector sequence $\left(v_{1}, v_{2}, v_{3}, v_{4}, v_{5}\right)$ will detect the transition faults in the EXOR part of the multiplier circuits.

Proof: Follows from the above discussions. $\square$
Testability in the AND gate: The following tests will detect transition faults in a three- input AND gate with inputs $\mathrm{a}, \mathrm{b}, \mathrm{k}$.

Table 2: input state diagram for AND gate

| Instant | $\mathrm{t}_{1}$ | $\mathrm{t}_{2}$ | $\mathrm{t}_{3}$ | $\mathrm{t}_{4}$ | $\mathrm{t}_{5}$ | $\mathrm{t}_{6}$ | $\mathrm{t}_{7}$ |
| :---: | :---: | ---: | ---: | ---: | ---: | ---: | :--- |
| a | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| b | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| k | 1 | 1 | 1 | 1 | 1 | 0 | 1 |

If the vector pair $(\mathrm{a}, \mathrm{b}, \mathrm{k})=(111,011)$ is applied, first vector 111 initialises AND output at logic value 1. When a slow-to-fall transition i.e. transition from $t_{1}$-to- $t_{2}$ occurs at 'a' input node, test vector 011 will show 1 at the output instead of 0 , i.e. output retains previous value. Again, if the vector pair $(011,111)$ is applied, first vector 011 initialises the AND output at 0 . When a slow-to-rise transition i.e. transition from $t_{2}$-to- $t_{3}$ occurs at 'a' input node, test vector 111 will show 0 instead of 1 at output i.e. the output retains 0 . Similarly, vector sequences (111, $101,111)$ and $(111,110,111)$ will detect transition fault at ' $b$, and ' $k$ ' input nodes respectively.

Lemma 2: The vector sequences $\mathrm{v}_{2}, \mathrm{v}_{3}, \mathrm{v}_{4}, \mathrm{v}_{5}, \mathrm{v}_{8}$, $\mathrm{v}_{6}, \mathrm{v}_{8}, \mathrm{v}_{7}, \mathrm{v}_{8}$ will detect the transition faults in the AND part of the multiplier circuits where,

$$
\begin{aligned}
& \left\{\begin{array}{lllllllllll}
k_{0} & k_{1} & k_{2} & a_{1} & a_{2} & \ldots & a_{m-1} & b_{1} & b_{2} \ldots & b_{m-1}
\end{array}\right. \\
& v_{6}=\left\{\begin{array}{llllllllll}
1 & 1 & 1 & 0 \ldots 0 & \ldots & 1 & 1 & 1
\end{array}\right\} \\
& v_{7}=\left\{\begin{array}{llllllll}
1 & 1 & 1 & 1 \ldots & 1 & 1 & 0 & \ldots
\end{array}\right\} \\
& v_{8}=\left\{\begin{array}{llllllll}
1 & 1 & 1 & 1 \ldots & 1 \ldots & 1 & 1
\end{array}\right\}
\end{aligned}
$$

Proof: The vector pair $\left(\mathrm{v}_{2}, \mathrm{v}_{3}\right)$ generates $(011,111)$ at the three inputs of AND gates connected with control input $\mathrm{k}_{2}$, and $(111,011)$ at three inputs of AND gates connected with $\mathrm{k}_{1}$ respectively. Again, the vector pair $\left(\mathrm{v}_{3}, \mathrm{v}_{4}\right)$ generates $(011,111)$ at the three inputs of the AND gates connected with control input $\mathrm{k}_{1}$, and $(111,011)$ at three inputs of AND gates connected with $\mathrm{k}_{0}$ respectively. The vector pair $\left(\mathrm{v}_{4}, \mathrm{v}_{5}\right)$ generates $(111,011)$ at the three inputs of the AND gates connected with control input $\mathrm{k}_{2}$. The vector pair ( $\mathrm{v}_{5}, \mathrm{v}_{8}$ ) generates ( 011 , 111) at the three inputs of the AND gates connected with control input $\mathrm{k}_{0}$. The sequence $\left(\mathrm{v}_{8}, \mathrm{v}_{6}, \mathrm{v}_{8}\right)$ will
produce $(111,101,111)$ sequence at three inputs of all the AND gates. The sequence ( $\mathrm{v}_{8}, \mathrm{v}_{7}, \mathrm{v}_{8}$ ) will produce the sequence $(111,110,111)$ at three inputs of all the AND gates in the multiplier circuits. Hence, proof.

Theorem 2: Any transition fault in the proposed Multiplier network is testable by the constant test set $T$ of length 10 , where $T=\left(v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{8}, \mathrm{v}_{6}\right.$, $\mathrm{v}_{8}, \mathrm{v}_{7}, \mathrm{v}_{8}$ ).
Proof: Follows from Lemma 1, and 2.

### 3.1 Gate Complexities

The gate complexities different types of polynomials are given in the Table 3. In this table, the value of $s$ is 1 for All One Polynomial (AOP) and $\mathrm{m} / 2$ for trinomial defined by $x^{m}+x^{k}+1$ where $\mathrm{k}=\mathrm{m} / 2$. For $\mathrm{k} \neq \mathrm{m} / 2, \mathrm{~s}=1$ for trinomial.

## 4. Experimental Results

We performed area, delay, power and test set size analysis on various $G F\left(2^{m}\right)$ multipliers based on different polynomial basis. The area, power and delay analysis are based on $0.18 \mu$ CMOS technology library from UMC. The table 4 shows that the area of some $G F$ multiplier circuits has been increased by approximately only 6 percent to ensure 100 percent testability. Since the overall multiplier complexity depends on primitive polynomial, there is a slight variation in percentage overhead [Table 4]. The comparative analysis of area, delay and power is shown in Fig. 8. On an average there is 6 percent increase in area and power. The delay overhead is negligible, when the overall delay of the multiplier is considered. The designs were synthesized using the Synopsys tools. Synopsys's Power Compiler® was used to estimate the power consumption.

Our test set is constant of length only 10 , which eliminates the need for test generation programs. Table 4 compares our test scheme with ATPG-based test generation. Synopsys ${ }^{\circledR}$ tools are used to generate ATPG based test pattern. In ATPG-based scheme, all the $G F$ multiplier circuits require more test patterns than that required in the proposed easily generated test generation scheme for achieving $100 \%$ fault coverage. For example, in ATPG based scheme, $G F\left(2^{16}\right)$ multiplier circuit requires 32 -bit more than 97 test patterns, whereas in our scheme it
requires only 32 -bit 10 test patterns. As our scheme requires only 10 constant test patterns, it ensures reductions in test application time and the associated power consumptions.

BIST scheme may be incorporated to generate the required 10 vectors internally. The BIST will provide two benefits: firstly it will eliminate the need for the three control inputs necessary for fully testing the multipliers, and secondly it will provide an added level of security. It can be designed with minimum hardware to generate 10 test vectors. Note that for all fields, the logic will remain same because the pattern remains the same. Only word length varies depending upon $m$.

## 5. Conclusion

This paper presents a C-testable design of PB bitparallel multipliers over $\operatorname{GF}\left(2^{m}\right)$ for achieving $100 \%$ faults coverage. For an $m$-bit multiplier circuit, a constant test set of length 10 for detecting both the transition and stuck-at faults is derived. The testable design requires $6 \%$ (avg.) extra hardware and 3 control inputs. BIST circuit can be used to generate test pattern internally. This eliminates the need for the three control inputs and also provides an added level of public-key security. The test set being very short in length, reduces test application time and test power.

## References

1. Y. Wu and M.I. Adham, "Scan-Based BIST Fault Diagnosis," IEEE TCAD, vol. 18, no. 2, pp. 203-211, 1999.
2. I.S. Reed and X. Chen, Error-Control Coding for Data Networks. Kluwer Academic, 1999.
3. J.H. Guo and C.L. Wang, "Systolic Array Implementation of Euclid's Algorithm for Inversion and Division in $G F\left(2^{m}\right)$," IEEE Trans. Computers, vol. 47, no. 10, pp. 1161-1167, 1998.
4. A. Reyhani-Masoleh, and M. A. Hasan, "Low Complexity Bit Parallel Architectures for Polynomial Basis Multiplication over $\operatorname{GF}\left(2^{\mathrm{m}}\right)^{\prime \prime}$, IEEE Trans. Computers, vol.53, no.8, pp.945959, 2004.
5. E.D. Mastrovito, "VLSI Architectures for Computation in Galois Fields," PhD thesis, Linkoping Univ., Linkoping, Sweden, 1991.
6. D. K. Pradhan, "A Theory of Galois Switching Functions", IEEE Trans. Computers, vol. 27, no. 3, pp.239-248, Mar. 1978.
7. H. Rahaman, D. K. Das, and B. B. Bhattacharya, "Testable design of GRM network with EXOR-tree for detecting stuck-at and bridging faults," $A S P D A C$ 2004, pp. 224229.
8. H. Rahaman, J. Mathew, A. M. Jabir and D. K. Pradhan, "Easily Testable Implementation for Bit Parallel Multipliers in GF ( $2^{\mathrm{m}}$ )", HLDVT 2006.
9. J. A. Waicukauski, E. Lindbloom, B. K. Rosen and V. S. Iyengar, "Transition Fault Simulation", IEEE Design \& Test of Computers, Vol. 4, No. 2, April 1987.
10. A. K. Pramanick and S. M. Reddy, "On the Detection of Delay Faults", ITC88, pp. 845-856.
11. G. L. Smith, "Model for Delay Faults Based Upon Paths", ITC85, pp. 342-349.
12. C. J. Lin and S. M. Reddy, "On Delay Fault Testing in Logic Circuits", IEEE TCAD, pp. 694-703, Sept. 1985.
13. I. Pomeranz and S. M. Reddy, "Static Compaction for Two-Pattern Test Sets", Proc. ATS, pp. 222-228, 1995.
14. X. Liu, M. S. Hsiao, S. Chakravarty and P. J. Thadikaran, "Novel ATPG Algorithms for Transition Faults", ETW, pp. 47-52, May 2002.
15. K. T. Cheng. Transition Fault Simulation for Sequential Circuits. In Proc. International Test Conference, pp.723-731, October 1992.
16. C. Haung and C. WU, "High-Speed C-Testable Systolic Array Design for Galois-Field Inversion," ED\&TC 97, pp.342-346.
17. BLAHUT, R. E. 1985. Fast Algorithms for Digital Signal Processing. Addison- Wesley.

Table 3: Gate complexities for different Polynomial Basis

| Polynomials <br> type | Original Implementation |  | Testable Implementation |  |
| :---: | :---: | :---: | :---: | :---: |
|  | \# of 2 inputs AND gate | \# of 2 inputs EXOR gate | \# of 3 inputs AND gate | \# of 2 inputs EXOR gate |
| ESP | $m^{2}$ | $m^{2}-s$ | $m^{2}$ | $m^{2}-s$ |
| Trinomial | $m^{2}$ | $m^{2}-1$ | $m^{2}$ | $m^{2}-1$ |
| Pentanomial | $m^{2}$ | $m^{2}+m$ | $m^{2}$ | $m^{2}+m$ |



Fig. 8a: Area: Original vs C-Testable Version


Fig. 8b: power : Original vs C-Testable Version


Fig. 8c: Delay Analysis: Original vs C-Testable Version

Table 4: Details of area and number of tests required to achieve $100 \%$ coverage

| GF multiplier | Irreducible <br> Polynomial | Area in $\mu \mathrm{m}^{2}$ |  | \% extra area for testability | \# of tests for $100 \%$ faults coverage |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | Original circuit | Testable circuit |  | Original ckt using ATPG | Testable circuit |
| $\mathrm{GF}\left(2^{2}\right)$ | $x^{2}+x+1$ | 400.02 | 425.8 | 6.4 | 9 | 10 |
| $\mathrm{GF}\left(2^{3}\right)$ | $x^{3}+x+1$ | 728.9 | 758.06 | 5.3 | 15 | 10 |
| $\mathrm{GF}\left(2^{4}\right)$ | $x^{4}+x+1$ | 1200.1 | 1238.1 | 5.18 | 22 | 10 |
| GF( $2^{5}$ ) | $x^{5}+x^{3}+x^{2}+x+1$ | 1748.3 | 1864.4 | 6.64 | 24 | 10 |
| $\mathrm{GF}\left(2^{6}\right)$ | $x^{6}+x+1$ | 2180.5 | 2296.6 | 5.32 | 38 | 10 |
| $\mathrm{GF}\left(2^{7}\right)$ | $x^{7}+x+1$ | 2819.2 | 2932.1 | 5.31 | 39 | 10 |
| $\mathrm{GF}\left(2^{8}\right)$ | $x^{8}+x^{4}+x^{3}+x^{2}+1$ | 3896.6 | 4103.0 | 6.7 | 50 | 10 |
| $\mathrm{GF}\left(2^{9}\right)$ | $x^{9}+x^{4}+1$ | 4431.9 | 4693.3 | 5.9 | 72 | 10 |
| $\mathrm{GF}\left(2^{10}\right)$ | $x^{10}+x^{3}+1$ | 5251.3 | 5573.8 | 6.52 | 97 | 10 |

