# A Mixed Signal Multiplier Principle for Massively Parallel Analog VLSI Systems

S. GETZLAFF, R. SCHÜFFNY Department of Electrical Engineering University of Technology Dresden Mommsenstraße 13, 01062 Dresden GERMANY

*Abstract:* - In this paper we introduce a mixed signal multiplier principle for the implementation of analog massively parallel computation arrays in a standard CMOS process. This principle is a powerful method to perform simple preprocessing algorithms for highly parallel data streams, e.g., image data, and is dedicated for compact low power implementations of both the acquisition and preprocessing unit. The processing of parallel data in their original parallel form avoids the high-power consuming serial processing commonly used in digital computation systems. Using simple electrical relationships, like Kirchhoff's current law, or the behavior of a single transistor we are able to perform multiplication and summation operations in a highly parallel manner, frequently required in image processing algorithms. To overcome the essential drawback of the lower accuracy of analog signal processing we present a mixed signal multiplier principle. As an application example an architecture for performing a 2D Discrete Cosine Transform is reported. System simulations of this architecture indicated a PSNR better than 40dB in consideration of given process variations.

Key Words: - parallel analog VLSI-systems, parallel computation, Discrete Cosine Transform, DCT, analog mul-<br/>tiplier, CMOS sensorIMACS/IEEE CSCC'99 Proceedings, Pages:6331–6337

## **1** Introduction

The increasing need for portable devices forces the development of signal processing systems with very low power consumption. It is also driven by concerns about the growing relative costs of power supplies and heat removal systems. Besides the further improvement of existing solutions it is necessary to investigate new approaches with respect to low power consumption.

The novel principle, presented in section 2, is appropriate for processing parallel data streams, e.g., from an image sensor array, without data serialization. It performs multiplication and summation of digital values with analog values in a highly parallel way with low power consumption and the feasibility of an implementation in a conventional digital CMOS-process. Frequently used image processing algorithms or transformations, e.g., filtering or DCT, are based on this two mathematical operations.

Section 3 presents an appropriate architecture for a common implementation of an image sensor and DCT unit performing the data compression.

For the implementation, described in section 4, we extensively used analog switched circuitry in the current domain. The class of Switched-Current (SI) circuitry features a high accuracy and a compact design

compared to continuous processing circuits. The achieved accuracy is sufficient for a wide range of image processing applications.

System simulations of the reported architecture exhibit the applicability of analog circuitry for image processing, as presented in section 5.

# 2 The mixed signal multiplier principle

Low level processing algorithms for parallel data usually require huge amounts of multiplications and summations. In most cases the product of an analog value, e.g., a pixel data of an image sensor, and a digital value, e.g., a coefficient, is calculated, and the results of several multiplications are summed. The conventional approach to manage this problem is depicted in Fig. 1.



Fig. 1: The conventional approach

The analog values are converted and exclusively processed in the digital domain, whereby the computation is performed sequentially. This results in a high accuracy and also in a high power dissipation due to the huge number of transfer operations and high clock rates. Avoiding this drawback, we present a new mixed signal approach, depicted in Fig. 2.



Fig. 2: The mixed signal parallel approach

The analog values are processed in their original parallel form. This avoids the transfer operations and results in a compact design with low power consumption. The essential drawback of limited accuracy due to the use of analog circuits is overcome by the selection of an appropriate circuit technique, that will be reported in later sections.

The operation to be performed can be expressed as Eq. (1).

$$\underbrace{I}_{i} = \sum_{k} \underbrace{X}_{i} \cdot \underbrace{\left(\sum_{k} C^{k} \cdot 2^{k}\right)}_{\text{analog}}$$
analog analog digital coefficient input (1)

The basic idea of the new approach is the sequential multiplication of the analog value 'X' with only one bit ' $C^{k}$ ' of the digital value at a time. For this purpose Eq. (1) is rearranged:

input

$$I = \sum_{k \neq j} 2^{k} \cdot \left(\sum_{k \neq j} \frac{X \cdot C^{k}}{j}\right); \quad C^{k} \in \{0, 1\}$$
(2)

accu shift Kirchhoff one-bit multiplier

Hence, the multiplication can simply be performed using a switch controlled by the value of  $C^k \in \{0,1\}$ , resulting again in an analog value. In the following this architecture is called one-bit multiplier. For summation we extensively use Kirchhoff's current law:

supposing a current representation of the analog values, that is advantageous for the use of current mode circuitry. The multiplication is completed by shift and accumulate operations ( $*2^k$ , accu), that can be done in a very easy and effective way in the digital domain after A/D conversion.

Now the principle is applied to the 2D DCT. The transformation can be written as:

$$\underbrace{I}_{|} = \sum_{i=0}^{7} \sum_{j=0}^{7} \underbrace{X_{i,j}}_{|} \cdot \underbrace{\left(\sum_{k} C_{i,j}^{k} \cdot 2^{k}\right)}_{\text{digital coefficients}}, \quad (4)$$
analog output

after rearranging we obtain:

$$\mathbf{I} = \underbrace{\sum_{k}}_{k} 2^{k} \cdot \underbrace{\left(\sum_{i=0}^{7} \sum_{j=0}^{7} \mathbf{X}_{i,j} \cdot \mathbf{C}_{i,j}^{k}\right)}_{(5)} \cdot (5)$$

digital domain

analog domain



Fig. 3: The mixed signal multiplier principle

One-bit multipliers are placed in an array, and all outputs which have to be summed up are simply connected by wires. At this point the A/D conversion can be performed by a current mode A/D converter [2]. Finally, shift and accumulate operations complete the one-bit multiplication to a full scale operation, as already mentioned. For visualization see Fig. 3.

Equation (5) shows, that many simple image processing algorithms, e.g., filtering and transformations, are dedicated for using the mixed signal multiplier principle.

The desired parallel processing is limited by the number of one-bit multipliers that can be implemented. However, an implementation of a higher number of multiplier cells is achievable due to the small required area of about  $500\mu m^2$  and the low power consumption

of less than 400nW for one cell, that will be discussed in section 4.



#### **3** The Architecture

Fig. 4: Architecture of sensor and 2D DCT unit

The mixed signal multiplier principle is applicable to performing image processing algorithms. In this section we describe an architecture (see Fig. 4) dedicated for the 2D DCT of image sensor data, where the sensor matrix and the DCT unit will be implemented on one die.

In contrast to section 2 the functionality of the one-bit multiplier has been extended. Due to the sequential processing of the coefficient-bits 'C<sup>k</sup>' the brightness values 'X<sub>i,j</sub>' have to be kept constant over all digital accumulations. Hence, we use a Sample & Hold stage to store the brightness value. Combined with a switch we obtain the extended one-bit multiplier, see inset of Fig. 4.

Now, the analog brightness value can sequentially be switched to a common summation line depending on the binary value of the coefficient-bit ' $C^k$ '.

The architecture, depicted in Fig. 4, works as follows: The first eight rows of the sensor pixel matrix are read out and written to an array of extended one-bit multipliers. In contrast to Fig. 3 the one-bit multipliers of our architecture are placed in columns of 64 elements. Therefore, all 64 DCT-coefficients can be applied to all one-bit multipliers at the same moment. For all coefficient-bits the output currents of the one-bit multipliers are summed up per column, and the A/D conversion is performed followed by appropriate shift and accumulate operations. This is done for all 64 DCTcoefficients, and, accordingly, the 2D-DCT for eight rows is performed. All steps described above are executed for every block of eight rows until all image data are processed.

## **4** Implementation

The main drawback of using analog circuitry for signal processing is the limited accuracy. To overcome this limitation we extensively used the Switched-Current technique (SI-technique) for the implementation of the algorithm described above. The SI-technique is well known and widely used for high accuracy analog signal processing applications, e.g., high precision A/Dconverters.



Fig. 5: Switched-Current memory cell

The S&H-circuit necessary for our application is realized by a fundamental SI-building block, the Switched-Current memory cell, depicted in Fig. 5. Neglecting all non-ideals the following relationship holds:

$$\mathbf{I}_{\text{out}}\big|_{\mathrm{T2}} = -\mathbf{I}_{\text{in}}\big|_{\mathrm{T1}}.$$
 (6)

The SI-memory cell has widely been investigated, and many solutions for minimizing non-ideal influences, e.g., clock-feedthrough or limited output resistance, are reported in several publications [6]-[8]. For our SImemory cell design we used a clock-feedthrough compensation technique proposed by D. M. W. Leenaerts et al. [5] and a simple cascode stage enhancing the output resistance, see Fig. 6. To handle positive and negative coefficients two separate summation lines per column have been implemented, one of these is also used for writing input data. The application of gate capacitors as storage devices is more area efficient than poly-poly capacitors and makes our circuit compatible to modern standard digital technologies.



Fig. 6: Implemented SI-memory cell

Due to the goal of a low power application, subthreshold transistor operation is essential. Therefore, we have to deal with currents of some ten nanoamps. This makes the design more sensitive to deviations due to technological variations, e.g., threshold voltage mismatch. The SI-technique is relatively insensible to these kinds of deviations, another pro for choosing it.

The residual relative error of the output current is below 1%, including threshold voltage mismatch and systematic errors, e.g., clock-feedthrough.

In preparation for an implementation of the whole system, comprising the sensor with preprocessing unit, a test structure containing three columns of 64 SI-memory cells has been implemented. We expect test results by the mid of July. For visualization the layout view of one cell is depicted in Fig. 7. In a standard 0.65 $\mu$ m CMOS process the cell is (24 × 19) $\mu$ m<sup>2</sup> in size. The

cell is designed to operate with a bias current  $I_0 = 50$ nA and a signal current of 0...100nA. Under these conditions the settling time is  $t_{wr} \le 5\mu s$ .



Fig. 7: Layout view of the implemented SImemory cell, size:  $(24 \times 19)\mu m^2$ 

## **5** System simulations



Fig. 8: Steps of the system simulation

A fundamental aspect of interest is how the non-ideal effects of the multiplier cells influence the behavior of the system. For this reason a system simulation has been performed, including the steps depicted in Fig. 8. To derive the necessary resolution 'x' of the A/D-conversion, this value has been parameterized. For all runs 12bit quantized DCT-coefficients have been used. To assess the image quality the PSNR (Peak Signal to Noise Ratio) is applied:

PSNR = 
$$10\log\left(\frac{255^2}{\frac{1}{i j}\sum_{ij} (I_1(i, j) - I_2(i, j))^2}\right), (7)$$

where  $I_1$  and  $I_2$  are the original and the distorted image, respectively.



Fig. 9: Results of system simulation

For the simulation an image with  $128 \times 128$  pixels and values between 0 and 255 have been used. The random error in the one-bit multiplier array is assumed to be non-correlated among the cells. A Gaussian distributed error is added to the summation result of each column, whereby the distribution has a standard deviation  $\sigma$ , relative to the summation result. Simulations of our SImemory cell indicated a standard deviation of  $\sigma = 1\%$ . To assess the system's behavior in case of higher cell errors, system simulations with  $\sigma = 2\%$  and  $\sigma = 3\%$ have been accomplished. These results for different resolution levels 'x' are reported in Fig. 9.

Selected images of the simulation results are shown in Fig. 11 to Fig. 13 to visualize image quality.

With respect to the error behavior of our SI-memory cell the system simulation shows good results with the typical error and even acceptable results with an increased error. The achievable image quality mainly depends on quantization, that can be adapted by choosing an appropriate A/D-converter. For lower quality an 8bit A/D-converter would be sufficient, whereas 10bit resolution is necessary for higher quality.



Fig. 10: Original image,  $128 \times 128$  pixel



Fig. 11: Simulation result with the typical standard deviation of  $\sigma = 1\%$  and an A/C-converter resolution of 10bit





Fig. 12: Simulation result with the increased standard deviation of  $\sigma = 2\%$  and an A/C-converter resolution of 10bit

PSNR=35dB



Fig. 13: Simulation result with the increased standard deviation of  $\sigma = 3\%$  and an A/C-converter resolution of 10bit

#### 6 Conclusion

In this paper the mixed signal multiplier principle was introduced. This principle is appropriate for multiplying analog and digital data in a highly parallel way. An application of the principle was presented by reporting on an architecture for performing a 2D DCT, where the signal acquisition system (an image sensor) and the DCT-processing unit are implemented on one die. This results in a very compact information processing system with low power consumption – an appropriate solution for portable applications. A test implementation of this architecture is under production.

The system behavior was investigated with respect to process variations. System simulations of the architecture showed good results, whereby a PSNR of 41dB was achieved for the typical error of our one-bit multiplier. Hence, we can conclude that the mixed signal multiplier principle as well as the chosen implementation are suitable to perform a 2D DCT in parallel. Further research is currently being carried out for implementing the whole system on a single chip and adapting the principle to related applications, e.g., image filtering and other preprocessing algorithms.

## Acknowledgement

This work was partly supported by 'Deutsche Forschungsgemeinschaft (DFG), Sonderforschungsbereich SFB-358, Teilprojekt A7'.

## References

- V. Bhaskaran, K. Konstantinides, *Image and Video Compression Standards*, Kluwer Academic Publishers, 1997
- [2] R. Srowik, R. Schüffny, A Low-Power Analog-to-

Digital Converter suitable for Systems-on-Chip Integration, accepted at the 6th International Conference on Mixed Design of Integrated Circuits and Systems (MIXDES'99), June 1999

- [3] C. Toumazou, J. B. Hughes, N. C. Battersby, SWITCHED CURRENTS an analogue technique for digital technology, Peter Peregrinus, London, 1993
- [4] C. Toumazou, F. I. Lidgey and D. G. Haigh, Analogue IC Design: The Current Mode Approach, Peter Peregrinus, London, 1990, Reprint 1993
- [5] D.M.W. Leenaerts, G.R.M. Hamm, M.J. Rutten, G.G. Persoon, High performance switched-current memory cell, *Proc. ECCTD*'97, Budapest, 1997, pp. 234-239
- [6] G. Wegmann, E. A. Vittoz, F. Rahali, Charge Injection in Analog MOS Switches, *IEEE Journal* of Solid-State Circuits, Vol. 22, No. 6, Dec. 1987, pp. 1091-1097
- [7] G. Wegmann, E. A. Vittoz, Analysis and Improvements of Accurate Dynamic Current Mirrors, *IEEE Journal of Solid-State Circuits*, Vol 25, No. 3, June 1990, pp. 699-706
- [8] J. Shieh, M. Patil, B. Sheu, Measurement and Analysis of Charge Injection in MOS Analog Switches, *IEEE Journal of Solid-State Circuits*, Vol. 22, No. 2, April 1987, pp. 277-281