# **Efficient VLSI Parallel Implementation for LDPC Decoder**

ANGUS WU and W. L. LEE
Department of Electronic Engineering
City University of Hong Kong
Tat Chee Avenue
HONG KONG

Abstract: - Iterative decoding of Low Density Parity Check (LDPC) codes using the Parity Likelihood Ratio (PLR) algorithm have been proved to be more efficient compared to conventional Sum Product Algorithm (SPA). However, the nature of PLR algorithm tends to put numerious pieces of data to this decoder and perform computation intensive operations, which is a major challenge for building a practical real-time LDPC decoder. In this paper, we employ extrinsic information clipping and calculation step merging techniques, which are used in modified sequential architecture, into the parallel implementation of LDPC decoder. The proposed parallel architecture decreases the decoding latency without increases the memory storage compared to existing modified sequential design. Simulation results show that the proposed architecture results in time savings of up to 96.12% and 37.94% over conventional direct sequential implementation and modified sequential design respectively.

Key-Words: - LDPC codes, parallel architecture, VLSI implementation, PLR algorithm

#### 1 Introduction

Like turbo codes [1], LDPC codes [2] belong to the general class of powerful concatenated codes that employing pseudo-random encoders and iterative decoders [3]. They are the two best known codes that are capable of achieving low bit error rates (BERs) at low signal to noise ratios (SNRs) [4], [5]. They are recent breakthroughs in coding theory that promise to push the areal density of the magnetic recording channel to its limits [6]. LDPC codes were proposed by Gallager in 1962 [2], and the performance is very closed to the Shannon limit [7]. However, LDPC codes were not pursued due to implementation complexity [8]. Nevertheless, the interest in iterative decoding algorithms has led to rediscover of LDPC codes [8] by MacKay and Neal [9], [10].

LDPC codes are similar to turbo codes in many aspects but are widely considered as serious competitors to turbo codes [11] in terms of performance and complexity as well as their similar philosophy bases: constrained random code ensembles and iterative decoding algorithm [12]. Recent advances in error correcting codes (ECCs) have shown that irregular LDPC codes can achieve reliable transmission at SNRs extremely close to the Shannon limit on the additive white Gaussian noise (AWGN) channel, outperforming turbo codes of the same block size and code rate [13].

There are some variation of LDPC decoding algorithm, Sum Product Algorithm (SPA) and Parity

Likelihood Ratio (PLR) algorithm. Although all current implementations of the decoder employ SPA for decoding, direct implementation of SPA can be very sensitive to the quantization effect. It has been proven that PLR technique can greatly reduce the quantization level requirement [14], which leads to significant reduction in decoding costs. As the horizontal step of PLR algorithm only involve look up table for the PLR function, their respective realization are straightforward and involve only table searching.

LDPC code applications become more on handling parity check in communication systems such as cellular mobile phone and video conferencing which requires real-time encoding or decoding of transmitted data. However, the randomness of LDPC code and large amount of intermediate processing data results in stringent memory requirements that amount to an order of magnitude increase in complexity [5]. Therefore, reducing the size of decoder becomes an increasingly concern for feasible VLSI implementation.

In [15], a modified sequential decoder architecture has been taken into account the clipping of extrinsic information [14] and combining horizontal backward step with extrinsic information calculation to simplify decoding process to boost decoding performance. Extrinsic information clipping decrease number of quantization levels and iterations required so as to reduce the size of finite state machine in the control unit and reduce the size

of look-up table. Decoding step merging eliminate the need to store intermediate variable of horizontal backward step and result in minimize memory storage and number of read-write cycles. These simplified the architecture complexity and results in a sub-optimal decoder design. Therefore, further improvement in the design of LDPC decoder can take advantage of these techniques used in modified sequential architecture for practical VLSI implementation.

In this paper, we propose a parallel architectural to employ the clipping and merging technique so as to propose a high performance but a comparatively low cost parallel decoder. Simulation results show that the proposed architecture is more efficient than and with same storage requirement as the modified sequential decoder in [15].

The rest of paper is organized as follows. Section 2 presents the two sequential LDPC decoders that use PLR algorithm for decoding. Then, Section 3 proposes a new optimized version of parallel architecture for reducing decoding time. Next, Section 4 presents some simulations results followed by some concluding remarks provides in Section 5.

# **2 Current Design Solutions**

Sequential implementation is one of the solutions for digital LDPC decoder since it alleviates the use of complex operations, apply in-place algorithm, synchronise timing signal, incorporate address counter and look-up tables for simplifying combinational arithmetic, solving latency problem due to interleaving process, utilising memory modules and further optimizing chip area. Besides, it takes the most significant bit in output section to eliminate exponent calculation and employs simple decision logic at output section for termination of iteration. These simplified the architecture complexity and results in a sub-optimal decoder However, a major problem with this approach is that even though many simplifications have been made, the size of chip is still significantly large due to the memory requirement and the big finite state machine. In addition, the 6-bit architecture has been shown to be only marginally satisfied with the specification of real-time transaction. Therefore, both decoding time and chip size is the main drawback of the direct sequential architecture for practical VLSI implementation.

Modified sequential architecture, which incorporates clipping of extrinsic information and

combining horizontal backward step with extrinsic information calculation, further reduce memory storage by a half and double the decoding speed. However, the big finite state machine inside the control unit makes further development of sequential decoder limited.

# 3 Implementation of Parallel Architecture

In general, parallel architectures for a given algorithm are attractive from an implementation perspective giving low power, high throughput, and simple control logic [16]. Parallel architectures are even more favorable for iterative algorithms if the data converges and codes that can iteratively decoded in a block parallel fashion. One such family of codes is LDPC codes [16]. LDPC codes are linear block codes with a sparse parity check matrix [16]. Above a code rate dependent minimum block size, powerful LDPC codes and PLR decoding algorithm maps quite well to a parallel decoder architecture in which the algorithm is directly instantiated in hardware. As illustrated in Fig. 1, higher throughput with a parallel decoder can be achieved by simply implementing a code with same block size and maintaining the same clock frequency compared to a sequential architecture [16]. The main challenge of implementing parallel decoder architecture for LDPC codes is the cycle arrangement of the control units. However, by careful management of the read-write process, it is possible to avoid address conflict and timing allocation problems.

The decoding sequence of LDPC code by PLR algorithm is carried out iteratively except initialization and output step. These recursive steps operated on a dimension can be classified into 4 categories. Updating step and horizontal forward step can be operated concurrently and can be classified as the "Horizontal Forward" category. However, vertical backward step and vertical horizontal step, which depends on the computation result of Horizontal Forward category, can be classified as another two categories. Finally. horizontal backward step merge with extrinsic information calculation can be classified as "Horizontal Backward" category and its computation depends on the two vertical steps, while the calculation result will be used in Horizontal Forward category of next dimension.



Fig. 1 Block diagram of parallel decoder architecture for LDPC code.

| category (row #) | port A                    | port B                    |
|------------------|---------------------------|---------------------------|
|                  | Horizontal forward (0)    | Horizontal forward (128)  |
|                  | ↓ ↓                       | $\downarrow$              |
| 255              | Horizontal forward (127)  | Horizontal forward (255)  |
|                  | Vertical backward (255)   | Vertical forward (0)      |
|                  | ↓ ↓                       | $\downarrow$              |
| 255              | Vertical backward (0)     | Vertical forward (255)    |
|                  | Horizontal backward (0)   | Horizontal backward (128) |
|                  | ↓ ↓                       | $\downarrow$              |
| 511              | Horizontal backward (127) | Horizontal backward (255) |
| Cycle(t)         | l                         |                           |

Fig. 2 Timing of decoding steps in a dimension.

Therefore, by carefully arranging timing of instructions in a dimension, the four categories of operations can actually work in parallel. Since nowadays dual-port memory modules are available, the categories can be re-arranged in parallel using two data ports. By applying the decoding steps arrangement shown in Fig. 2 to the modified sequential decoder, decoding speed can be doubled. Since dual-port memory units are used, the total number of memory bits can remain unchanged.

### 3.1 Input and Output Section

In each iteration, there are four equal time slots for performing large amount of calculations in the four dimensions. However, the first iteration is not for calculation but for input and output data. The output step will process when the signal of first iteration is asserted. Resulting value will be outputted after the decoding of the last dimension in the last iteration before next decoder input. The resulting values can

be reduced by only taking the most significant bit. This design reduces the exponent computation to convert the indices back from logarithmic domain. After that, information and parity will be inputted to the decoder respectively. Once data are inputted from ADC to decoder, the updating step and the horizontal forward step can be carried out simultaneously.

#### 3.2 Interleavers and Deinterleaver

The randomness of the interleaver output sequence makes it difficult to realize in low complexity combinational circuit. A direct interleaver implementation uses two banks of buffers alternating between read and write for consecutive sectors of data. The latency through an interleaver is therefore equal to the block size [16].



Fig. 3 Interleavers and deinterleaver implemented using dual-port ROM.

The basic block interleaver design uses a minimal amount of control logic. Using ROM for high-speed implementation, the interleaver inputs are data position in current dimension, while outputs are data position in previous dimension. In updating step, when reading in shuffled extrinsic information from last dimension, the RAM address is read from ROM data and the ROM address is read from control unit. This decreases the latency problem due to data interleaving since a long duration stage for data shuffle process is eliminated. The read-read operations are then repeated alternating between ROM and RAM as in sequential design. More sophisticated interleaver designs yield improved error rate performance, but result in increased implementation complexity [16]. Therefore, the implementation of the described basic interleaver provides a lower limit on complexity [16].

#### 3.3 Control Unit

In iterative programs, like LDPC code decoding, execution proceeds as a sequence of sequential iterations, where at each iteration all parallel processes corresponding to logical function and variables can execute independently, but each logical function then needs to communicate values computed during that iteration with other variables it is connected, before it can commence its next iteration. As shown in Fig. 4 and Fig. 5, the control flow is done in such a way that every iteration, a logical function sends data to its logical neighbours and then waits until it receives messages back from all of next iteration to any of these neighbours.

In the control unit, timing controller, iteration controller and dimension controllers are responsible for implementing decoding steps recursively. For that reason, iteration controller is to activate one necessary dimension controller at a time and pass iteration number for dimension controller to use. The whole control unit incorporates simple decision logic that uses a sign-controlled signal from the timing controller to indicate the first and final iteration of a data block. A simple state machine like the one shown in Fig. 4 is used to maintain the state of each iteration in the decoder. A cycle can be in one of four states: output state will output results of the last dimension which are stored in memory module. Upon that, received data moves into the memory to overwrite previous data block in the input state and extrinsic information variables will be reinitialized in the initialization state. This ends the first iteration.



Fig. 4 State machine for iteration controller of the control unit.



Fig. 5 State machine for dimension controller of the control unit.

Started from second iteration is the data processing state, it follows the state machine as shown in Fig. 5. For sequential implementation, four states are carried out consecutively. If it reaches maximum number of dimensions, it will carry on to next iteration and start horizontal forward state without re-initialize any variables. Until it reaches maximum number of iterations, it will back to output state and repeats the same cycle. Moreover, dimension controller, which incorporating a big finite state machine, for memory read-write controlling is employed so that all control signals employed are being well matched and synchronized.

With the proposed methodology, the timing controller, iteration controller and four dimension controllers can be combined into one single control unit. It consists only one finite state machine but perform the same operation and have the same decoding effect as the direct sequential architecture.

#### 3.4 Memory

Although there are a large number of intermediate variables, some values that serve as local variables, which will not be referenced again in the next dimension or iteration, can share a temporary register inside the control unit. According to this motivation, the size of memory can be reduced so that the overall average cost of implementation can be minimized.



Fig. 6 RAM configuration for one dimension.

Both data and intermediate results will be stored into the RAM. The allocation of variables referred to as a memory map for a RAM is shown in Fig. 6. The memory map allows a RAM that performs decoding to become switching received data, intermediate variables and extrinsic information. The switch happens simply by telling the control unit to execute at a given location in the RAM. Treating intermediate variables in the same way as decoding data greatly simplifies the RAM address calculation in the control unit. Fig. 7 shows the formats of RAM address to connect the fields of decoding step to the algorithm.

| dim                  | var | col | row    |
|----------------------|-----|-----|--------|
| 2 bits 2 bits 2 bits |     |     | 8 bits |

Fig. 7 RAM address format.

Each iteration share same memory space in the RAM. This sharing is made possible by not assigning iteration field in the range of RAM address. The "dim" field is contained in bits 13~12. The 8-bit row number is in positions 7~0. The data and intermediate variables to be read or write are specified by "var" fields at position 11~10. The column number is in bit position 9~8. However, this is not true for field var=10<sub>2</sub> while the "col" field will be used to indicate intermediate variables to be access as well. The two kinds of datapath can then use one address format.



Fig. 8 ROM configuration.

Look-up tables are resided in ROM as shown in Fig. 8. The four kinds of operation in logarithm domain that it implements are f-function, addition, clipped addition and subtraction. Starting top down, the f-function starts at  $000_{16}$ . At the other end, the subtraction starts at  $C00_{16}$ . The addition starts at  $400_{16}$ . Clipped addition is next and it can look up from  $800_{16}$  to  $BFF_{16}$ .

| opcode | operand 1 | operand 2 |  |
|--------|-----------|-----------|--|
| 2 hits | 5 hits    | 5 hits    |  |

Fig. 9 ROM address format.

The ROM address format is set to make it easy to perform table look-up. It is concatenated by a 2-bit opcode and two 5-bit operands as shown in Fig. 9. The 5-bit operand fields are sign-and-magnitude notation and the look-up result is also 5-bit sign-and-magnitude index in logarithm domain. Address format for f-function operation, which has an opcode of  $00_2$ , can be implemented by a combinational circuit. Then, the opcode becomes a select signal input to the multiplexer for choosing between combinational result and ROM output.

# 4 Simulation Results

The direct sequential, modified sequential and the proposed parallel decoder were synthesized with Synopsys computer aided design tool using 0.38 microns technology under 3V supply based on a 1024 bit, rate-1/2 LDPC code. This corresponds to one of the block sizes and code rates proposed for 3G wireless turbo codes. In our simulation, we adopt 4 dimensions of each 256 rows and 4 columns with 16 iterations including one iteration for input-output section. The effect of using clipping and merging under Virtex implementation is shown in Fig. 10 and Fig. 11.



Fig. 10 Performance of direct and modified sequential LDPC decoder.



Fig. 11 Memory requirement comparison between direct and modified sequential architecture.

It can be observed that the performance of parallel decoder with dual-port scheme is very superior to that of the sequential case. The total decoding time of the parallel decoder is 650752 clock cycles which was 96.12% and 37.94% less than the direct sequential decoder and the modified sequential design respectively. In addition, the memory requirement of the modified one was 151552 bits which was the same as the modified sequential one. Although dual-port memories may imply more silicon is required for implementing such decoder, the chip area increase is nearly negligible because it is only a small portion of logic in the whole design. As far as both speed and memory requirement is concerned, the proposed architecture shortens the decoding latency and without increasing memory storage. Hence, it is more feasible in real-time mobile communication applications.

# 5 Conclusions

This paper presented a parallel architecture for LDPC decoder. The decoder is simulated and synthesized with Synopsys computer aided design tool for Virtex implementation. We propose to employ dual-port parallel implementation to the modified sequential architecture to achieve high performance low requirement decoder without increasing memory requirement. By comparing the simulation results, it can be observed that with the parallel design, the decoding speed is nearly one-third of and the required memory storage remains the same as modified sequential decoder under 3V supply.

# Acknowledgement:

This work is supported by CityU PAG grant 7100049.

#### References:

- [1] C. Berrou, A. Glavieux and P. Thitimajshima, "Near Shannon Limit Error-correcting Coding and Decoding: Turbo Codes," *IEEE ICC*, 1993, pp. 1064-1070.
- [2] R. G. Gallager, "Low Density Parity Check Codes," *IRE Transactions on Information Theory*, vol. 8, January 1962, pp.21-28.
- [3] Ping Li and Keying Y. Wu, "Concatenated Tree Codes: A Low-complexity, High-performance Approach," *IEEE Transactions on Information Theory*, vol. 47, no. 2, February 2001, pp. 791-799.

- [4] Igal Sason, Shlomo Shamai, "Improved Upper Bounds on the Ensemble Performance of ML Decoded Low Density Parity Check Codes," *IEEE Communications Letters*, vol. 4, no. 3, March 2000, pp. 89-91.
- [5] Mohammad M. Mansour and Naresh R. Shanbhag, "Low-power VLSI Decoder Architectures for LDPC Codes," Proceedings of the International Symposium on Low Power Electronics and Design, 2002, pp. 284-289.
- [6] Thomas Mittelholzer, Ajay Dholakia and Evangelos Eleftheriou, "Reduced-complixity Decoding of Low Density Parity Check Codes for Generalized Partial Response Channels," *IEEE Transactions on Magnetics*, vol. 37, no. 2, March 2001, pp. 721-728.
- [7] Hisashi Futaki and Tomoaki Ohtsuki, "Low-density Parity-check (LDPC) Coded OFDM Systems," *IEEE Vehicular Technology Conference*, vol. 1, 2001, pp. 82-86.
- [8] Chris Howland and Andrew Blanksby, "A 200mW 1Gb/s 1024-bit Rate-1/2 Low Density Parity Check Code Decoder," *IEEE Conference* Custom Integrated Circuits, 2001, pp. 293-296.
- [9] D. J. C. Mackay and R. M. Neal, "Near Shannon Limit Performance of Low Density Parity Check Codes," *Electronic Letters*, vol. 32, August 1996, pp. 16454-1646.
- [10] D. J. C. Mackay and R. M. Neal, "Near Shannon Limit Performance of Low Density Parity Check Codes," *Electronic Letters*, vol. 33, March 1997, pp. 457-458.

- [11] Tong Zhang, Zhongfeng Wang and Keshab K. Parhi, "On Finite Precision Implementation of Low Density Parity Check Codes Decoder," *IEEE International Symposium on Circuits and Systems*, vol. 4, 2001, pp. 202-205.
- [12] Thomas J. Richardson and Rudiger L. Urbanke, "Efficient Encoding of Low-density Parity-check Codes," *IEEE Transactions on Information Theory*, vol. 47, no. 2, February 2001, pp. 638-656.
- [13] Jilei Hou, Paul H. Siegel, Laurence B. Milstein, "Performance Analysis and Code Optimization of Low Density Parity-check Codes on Rayleigh Fading Channels," *IEEE Journal on Selected Areas in Communications*, vol. 19, no. 5, May 2001, pp. 924-934.
- [14] Ping Li and W. K. Leung, "Decoding Low Density Parity Check Codes With Finite Quantization Bits," *IEEE Communications Letters*, vol. 4, no. 2, February 2000, pp. 62-64.
- [15] W. L. Lee and Angus Wu, "Modified VLSI Implementation for Sequential LDPC Decoder," *Proceedings of WSEAS International Conference on Electronics, Control and Signal Processing*, December 2002.
- [16] Engling Yeo, Payam Pakzad, Borivoje Nikolic and Venkat Anantharam, "VLSI Architectures for Iterative Decoders in Magnetic Recording Channels," *IEEE Transactions on Magnetics*, vol. 37, no. 2, March 2001, pp. 748-755.