# Hardware implementation of a digital processing of nuclear medical imaging acquisition and processing system

Bouraoui MAHMOUD., Habib ESSABBAH, Med Hedi BEDOUI Group of Medical Imaging Technology Faculty of Medicine at Monastir Laboratory of Biophysics at Faculty of Medicine at Monastir - 5019 Monasir– Tunisia E-mail: bouraoui.mahmoud@fsm.rnu.tn

*Abstract:* - The present paper deals with the implementation of digital processing module of nuclear medical imaging acquisition and processing system. The digital processing is composed by the conversion interface, the spectrometry, the position calculation, the linearity correction and the PC communication. We present specification and description of digital processing, some own optimization of algorithms of this module and hardware implementation results of three types of FPGA circuit (Spartan, Virtex of Xilinx and Cyclone of Altera). Comparisons of performance (latency, cadence and implementation space) between the three circuits and between the hardware implementation and software implementation are proposed.

Key-Words: - Gamma camera, FPGA, DSP, Implementation, Spectrometry, Linearity, scintigraphic image construction

## **1** Introduction

During the last years the medical imaging world has seen a great evaluation. Their applications became more and more sophisticated. They need more elevated performances and a larger service variety offered by the acquisition and processing system that supports them. The image acquisition and processing technology has seen a prosperity and produce a good quality of image and a count rate number measured in millions event per second.

The digital processing algorithms of nuclear medical imaging systems implementation can be done several wavs: Software, hardware with or hardware/software. The choice between the different implementations depends on the application and some constraints (speed, count rate, performances ...). However, the software implementation of processing algorithms stays the simplest solution to put in work. Yet this approach penalizes the real time for choice improper of the platform. As for the hardware solution, it remains the most promising implementation technique of the real time applications. Besides, the evolution of integration techniques makes the hardware solution more preferment then the software and less costly to put in to work.

We realized an acquisition and processing system of their signals coming from a gamma camera (SOPHY DS7) detection head. The results have been presented in [1]. It consists in an implementation of an analogue block (signal shaping and analogical processing) and a digital block designed around a DSP processor (Digital Single Processor) (image construction, data correction and data transfer). A doubles access memory (DPRAM) assures the PC communication. The limitations of this approach consist in a many connections between these components and the complexity of realization. The conceived digital module is not extensible and the test phase is long. On the other hand, the implemented algorithms in the DSP function in serial way that lengthens the system dead time.

As solution, to these problems, we opted for a hardware implementation on an FPGA circuit (Field Gate Programmable Array) in order to benefit interesting features of this circuit (parallelism, more implementation space, flexibility, speed, configurable interns memory, the re configurability, development and test easiness...).

This paper is dedicated to the digital processing module design study of the acquisition and processing system of nuclear medical imaging implanted in an embedded circuit. We start with the system specification and its description and we finish by the implementation in three types of FPGA circuits. Besides, a comparison of performances (latency, cadence and implementation space) between the three FPGAs circuits (Spartan and Vertex of Xilinx, Cyclon of Altera) is proposed.

## 2 Specification and description of application

## 2.1 Nuclear medical image acquisition and processing system

The nuclear imaging is called scintigraphic imaging. The scintigraphic images are based on the radiation detection emitted by a radioactive isotope injected to the patient. These images give functional information.

The radiation detection principle is similar to the Anger gamma camera [2]. The detector is composed of a

crystal scintillator (NaI(Tl)), a photomultiplicateurs matrix (PM), a prélocalisation circuit (calculate the impact radiation position), a preamplification circuit.

In this application, we worked on gamma camera so-called hybrid or semi-analogical. The detection head output is composed by five analogical signals, four position signals  $(X^+, X^-, Y^+, Y^-)$ , and energy signal E. These signals inform us respectively on the spatial position of the radioactive impact on the detection fields and the energy of the radioisotope used (fig.1).



Figure 1: (a): detection head principle of a semianalogical gamma camera; (b): shape of output signals  $(X^+, X^-, Y^+, Y^- \text{ and } E)$ . These signals are injected in an analogical module (signal shape and analogical processing), then they are routed toward a conversion module (ADC) permitting the digital analogue conversion of useful information of every signal with an 8 bits resolution. In the output of this module we have 5 digital signals ( $X^+$ ,  $X^-$ ,  $Y^+$ ,  $Y^-$  and E) and a word of status qualifying the presence of an event to count and to process.

In the output converters, a digital processing module permits the spectrometry, the position calculation (X, Y), the linearity correction, the control and the communication with the ADC and the data he communication with the PC (figure 2).

We implement these all functions on the FPGA circuits of the Spartan family, Virtex of Xilinx and cyclone of Altera, which the features are presented in the following section.

#### 2.2 The features of FPGA circuits and operators

We implement the corresponding design on the FPGA circuits of the Spartan family, Virtex of Xilinx and cyclone of Altera. The FPGA are the programmable circuits which last generations permit the conception of big systems using some complex calculation blocks.

| FPGA Circuit        | gates   | CLB or<br>LE | I/O<br>Number | Block<br>RAM | Bits<br>Select<br>RAM |
|---------------------|---------|--------------|---------------|--------------|-----------------------|
| Spartan<br>(XCS 40) | 40000   | 28 * 28      | 224           | 0            | 25 088                |
| Virtex<br>XCV1000   | 1569170 | 64 * 96      | 660           | 96           | 393.216               |
| Cyclone<br>EP1C20   | _       | 20 060       | 301           | 64           | 294 919               |

Tableau 1: FPGA circuits features



Figure 2: nuclear medical imaging acquisition ant processing system design

We present in the table 1 the features of three used FPGAs circuits. Basic operations used in our application algorithms are the addition, the subtraction, the multiplication, the division and the comparison

To compare on-line designs with others, we have to define the criteria for efficiency of a design. The efficiency of an arithmetic implementation can be measured by three numbers: latency, throughput, and area [4]:

- Latency: The latency is the number of time units between applying the inputs and getting the output.

-Throughput: The throughput of a design is defined as the number of input sets the design can process per time unit. In on-line designs, the throughput is limited by the component that is busy for the most clock cycles to process the inputs and produce the output.

-Area: The area gives the number of area units in the chip that are occupied by the design. It is a measure for costs and for power consumption of a design. For FPGA designs, area is mostly expressed as the number of CLBs, Slices or LEs. digital processing module (figure 3). It organizes the data transfer on a 16 bits width bus.



Figure 3: conversion interface module design

| Operators      | size en | FPGA Spartan XCS40<br>(25Mhz) |               | FPGA Virtex<br>XCV1000 (50Mhz) |               | FPGA Cyclone<br>EP1C20 (50Mhz) |              |
|----------------|---------|-------------------------------|---------------|--------------------------------|---------------|--------------------------------|--------------|
|                | bits    | speed (ns)                    | size<br>(CLB) | speed<br>(ns)                  | size<br>(CLB) | speed (ns)                     | size<br>(EL) |
| addition       | 8       | 10.2                          | 6             | 6.514                          | 4             | 6.57                           | 8            |
|                | 16      | 11                            | 10            | 6.551                          | 8             | 7.07                           | 16           |
| subtraction    | 8       | 10.28                         | 6             | 6.548                          | 4             | 6.65                           | 8            |
|                | 16      | 11                            | 10            | 6.551                          | 8             | 7.15                           | 16           |
| multiplication | 8       | 10.2                          | 55            | 6.556                          | 36            | 7.83                           | 133          |
|                | 16      | 10.4                          | 210           | 6.556                          | 140           | 8.59                           | 529          |
| division       | 8       | 25.8                          | 133           | 13.04                          | 129           | 38                             | 102          |
|                | 16      | 66.3                          | 316           | 32.28                          | 343           | 48.8<br>F=30MHz                | 334          |
| Comparator     | 8       | 10.3                          | 4             | 6.544                          | 13            | 6.15                           | 24           |
|                | 16      | 10.3                          | 9             | 6.544                          | 23            | 6.27                           | 46           |

Figure 2: nuclear medical imaging acquisition ant processing system design

We present in the table 2 the size and the speed of basic operators for different sizes of operands. These data are determined by the synthesis and implementation results of on-line operator used in our application on an FPGA spartan, Virtex of xilinx and an FPGA Cyclone of Altera.

#### 2.3 The conversion interface module

This interfacing is the bus master for the transfer of the five converted data. While controlling the state of the two adjacent modules (analogical module and conversion module), when an event (radioactive impact on the detection fields) is present the interface block acquire the five digital signal and transfer them to the

#### 2.4 The spectrometry

The spectrometry is based on the energy value analysis (E). The spectrometry algorithm is based on the energy value windows. There are several correction methods using each a certain number of spectrometric windows (figure 4) [5, 6, 7].

 $\begin{cases} If & E \in W_1 \cup W_2 \cup ... \cup W_3 \implies accepted \quad event(S=1) \\ If & E \notin W_1 \cup W_2 \cup ... \cup W_3 \implies rejected \quad event(S=0) \end{cases}$ (1)



Figure 4: on-line Spectrometry circuit design

The algorithm contains only comparisons, every window has two thresholds (Tb1, Th2), and therefore we have two comparisons by window. The implementation of the different tests of the energy value (E) adherence to a window is achieved in parallel processes. Every test is independent to other ones.

The spectrometry circuit requires three stages (figure 4): a memorization stage (register), a comparison stage, a decision stage. Every window requires two comparators (8 bits), two registers and a decision block. The two comparators work in parallel.

#### 2.5 Coordination calculation (X, Y)

The event spatial coordinates (X, Y) is defined by the following formulas [3]:

$$X = k \frac{X^{+} - X^{-}}{X^{+} + X^{-}} \qquad et \qquad Y = k \frac{Y^{+} - Y^{-}}{Y^{+} + Y^{-}}$$
<sup>(2)</sup>

Where k is a weight factor.

To earn in execution time and to reduce the mathematical operations number, we can optimize the algorithm and remove the division operation while replacing it by a simple access memory. It is easily to do, because  $(X^+ + X^-)$  and  $(Y^+ + Y^-)$  values are integer between 0 and 255, and k value is constant. The  $(X^+ + X^-)$  value, respectively  $(Y^+ + Y^-)$ , serves to the corresponding memory access address. The expression  $(k/(X^+ + X^-))$  and  $(k/(Y^+ + Y^-))$  values are stored in the memory blocks. The memory is allocated in the space memory of the FPGA circuit.

The two calculation circuits of X and Y are implemented and function in parallel. The algorithm includes two additions and two subtractions executed one parallel, two accesses memory executes in parallel and two multiplications in parallel (figure 5).



Figure 5: On-line calculation circuit of X and Y design

#### 2.6 Linearity correction

To reduce linearity defaults produced by the detection head and the electronics of the analogical processing. We applied a correction on X and Y values. Two linearity correction coefficient memory ( $\Delta x$  and  $\Delta y$ ) (figure 6) are previously defined and implemented in the FPGA circuit. The corrected coordinates (X' and Y') of the event impact is given by the following relation:

$$X' = X \pm \Delta x$$
 et  $Y' = Y \pm \Delta y$  (3)

 $\Delta X$ : Errors' following the X axes  $\Delta Y$ : Errors' following the Y axes



Figure 6: on-line linearity correction algorithm design

The memory is addressed by X and Y values. The algorithm includes two parallel addition and readings memory. The X and Y data are stocked in two registers in order to take the necessary time to address the coefficients memory.

#### 2.2 PC-FPGA communication

In this new design of the digital processing module, the communication between the module and the PC is assured by the FIFO memory and not with the double access memory as in [1]. The FIFOs are implemented in the FPGA circuit (figure 7). We need two FIFO, one for the data transfer and communication protocols of the

module toward the PC and the other for transfers in inverse sense.



Figure 7: PC-FPGA communication design

We implemented our application on three different platforms basis on FPGA circuit. The first is realized in our laboratory on a basis of FPGA circuit Spartan (XCS40) of xilinx. The second is the PCI RC1000-PP board commercialized by Celoxica, It designed around an Virtex FPGA Circuit XCV1000 of Xilinx. The third is a NIOS development kit, edition Cyclone of Altera, this board is designed around an FPGA circuit Cyclone EP1C20.

The synthesis and implementation results (execution time, implementation space and memory space) of different algorithm of our application for the three types of FPGA circuit are presented in the table 3. The FPGA Spartan circuit functions with a clock frequency of 25MHz [8]. It can process until 820 Kcps. The FPGA Virtex circuit functions with a clock frequency of 50 MHz. It can process until 1.2 Mcps. Since. The FPGA Cyclone circuit functions with a clock frequency of 50 MHz. It can process until 1.3 Mcps.

|                                | FPGA Spartan<br>XCS40 (25MHz) |     | FPGA Virtex XCV1000<br>(50MHz) |        |                   | FPGA Cyclone EP1C20<br>(50MHz) |     |                  |
|--------------------------------|-------------------------------|-----|--------------------------------|--------|-------------------|--------------------------------|-----|------------------|
| Function                       | Delays<br>(ns)                | CLB | Delays<br>(ns)                 | Slices | Block<br>memory   | Delays<br>(ns)                 | ELs | Memory<br>(bits) |
| conversion<br>interface module | 240                           | 44  | 140                            | 33     | 0                 | 140                            | 61  | 0                |
| Spectrometry                   | 30                            | 4   | 9.43                           | 4      | 0                 | 10                             | 3   | 0                |
| X calculate                    | 420                           | 148 | 94.7                           | 32     | 1<br>(4096 bits)  | 49.1                           | 154 | 2048             |
| Linearity<br>correction of X   | 17                            | 90  | 37.2                           | 5      | 1<br>(4096 bits)  | 28.8                           | 9   | 2048             |
| Y calculate                    | 420                           | 148 | 94.7                           | 32     | 1<br>(4096 bits)  | 49.1                           | 154 | 2048             |
| Linearity<br>correction of Y   | 17                            | 90  | 37.2                           | 5      | 1<br>(4096 bits)  | 28.8                           | 9   | 2048             |
| PC-FPGA<br>interface           | 540                           | 159 | 564.8                          | 92     | 2<br>(8192 bits)  | 510                            | 194 | 65536            |
| Total                          | 1217                          | 683 | 836.7                          | 203    | 6<br>(24576 bits) | 727.9                          | 584 | 73728            |

### **3** Synthesis and implantation results

Table 3: Processing time for different implementation target on different FPGAs circuits

| Process            | Temps (µs) |                |             |                          |                              |                              |  |  |
|--------------------|------------|----------------|-------------|--------------------------|------------------------------|------------------------------|--|--|
|                    | DS         | P Processor [M | [ah 03]     | FPGA Circuit             |                              |                              |  |  |
|                    | TMS320C50  | TMS320C31      | TMS320C6711 | Spartan XCS40<br>(25MHz) | Virtex<br>XCV1000<br>(50MHz) | Cyclone<br>EP1C20<br>(50MHz) |  |  |
| Digital processing | 7          | 2.8            | 1.4         | 1.217                    | 0.836                        | 0.727                        |  |  |

Tableau 4: processing time of different target of software or hardware implementation

### **4** Discussion

The success integration of medical imaging application in a programmable circuit is evaluated by performances, cost, consumption and flexibility.

The Software implementation of the module gives a good flexibility, but a limited performance [1]. The module integration in an embedded circuit (FPGA) permits to improve performances in process time, count rate, clutter and cost (table 4).

This study permitted to explore implementation alternatives in a circuit FPGA, of the digital module, permitting the improvement in the data processing delay and the reduction of the board clutter and the cost.

We remark that, the hardware implementation of digital processing module gives good results concerning latency and spaces. Besides the exploitation of the FPGA circuit features (parallelism and execution speed) are contributed well to decrease the module dead time and the system clutter. The majority algorithms and operations of our application showed a design of working in parallel. The table 4 shows that a hardware implementation of our system is better.

The processing time is decreased by, first, the new organization in parallel of calculation blocks. Let's note the main impossibility of such an organization of algorithms programmed on mono-processors.

The details exploitation in the application algorithm in the goal to optimize the realization is another principle of our conception refinement.

The utilization of the target implementation details for optimization is another principle of perfection to keep. For this reason, that the use of the FIFO channels in the current conception has been preferred to the double access RAM solution.

## 5 Conclusion

We implemented the digital module processing in three FPGA circuit (Spartan and Virtex of Xilinx, Cyclone of Altera) in real time.

The digital processing module is composed by five parts: conversion interface module, spectrometry, position calculation, linearity correction, PC-FPGA communication. In order to facilitate the implementation and to reduce the dead time of the system, we applied some own optimization for the five parts.

The synthesis and implementation results on the FPGA of current generations (Virtex and Cyclone) showed the extended possibilities in terms of speed and clutter of the system achieved. The use FPGA Virtex circuit permits a little memory space used in the application than the FPGA cyclone circuit. But, the latest shows a speed greater than the one of FPGA Virtex.

Now, we implemented an extended processing in

real time as the image reconstruction, pixel intensity calculation, uniformity correction. These three processing types include the arithmetic operations on matrix (the original matrix of the image and the correction matrixes). In this case we can implement these processes of a codesign hardware/software. The advantage of hardware/software implementation is to configure, to recharge all there parameters of the system (linearity uniformity coefficients, correction correction coefficients, spectrometry windows values, circuit control parameters, circuit configuration parameters...) and to calibrate automatically the system in reel time.

#### References:

- [1] B. Mahmoud, M.H. Bedoui, R. Raychev, H. Essabbah, "Conception d'un système PC Compatible d'acquisition et de traitement des images en médecine nucléaire" *Jounal of Biomedical Engieering: Innovation and Technology in Biology and Medecine (ITBM-RBM), Elsevier Press,* Vol. 24- N°5/6- PP. 264-272 Décembre 2003.
- [2] H.O. Anger, "Scintillation Caméra ". Rev. Sci. Instrum, vol. 29, pp. 27-33, 1958
- [3] J.P. Esquerré, B. Danet, P. Gantet, "Evolution des gamma caméras" *Revue de l'ACOMEN*, vol. 2(2), pp. 161-174, 1996
- [4] J.-L. Beuchat. Étude et conception d'opérateurs arithmétiques optimisés pour circuits programmables. *PhD thesis, École Polytechnique Fédérale de Lausanne*, 2001. Thèse No 2426.
- [5] RJ. Jaszczak, CE. Flayd, RE. Coleman, "Scatter compensation techniques for SPECT" *IEEE Trans. Nucl. Sci.*, vol. 32 pp. 786-793, 1985.
- [6] K. Ogawa, Y. Harata, T. Ichihara, A. Kubo, S. Hashimoto "A practical method for positiondependent compton-scattered correction in single photon Emission CT", *IEEE Trans. Med. Imaging*, vol. 10, pp. 408-412, 1991
- [7] T Ichihara, K Ogawa, N Motomura, A Kubo, S Hashimoto, "Compton Scatter compensation using the triple-energy window method for single-anddual-isotope SPECT" *J. Nucl. Med*, vol. 34 pp. 2216-2221, 1993.
- [8] B. Mahmoud, M.H. Bedoui, R. Raychev, H. Essabbah, "Nuclear medical image treatment system based on FPGA in real time", *International Journal of Signal Processing IJSP*, Vol. 1, pp. 61-64, 2004.