# м.

### A Reduced Routing Network Architecture for Partial Parallel LDPC Decoders

Houshmand Shirani-mehr<sup>1,2</sup>, Tinoosh Mohsenin<sup>3</sup>, Bevan Baas<sup>1</sup>

<sup>1</sup> VCL Computation Lab, ECE Department, UC Davis
<sup>2</sup> Intel Corporation, Folsom, CA
<sup>3</sup> University of Maryland, Baltimore County

### LDPC codes and Their Applications

- Superior error correction performance
- Recently adopted for:
  - IEEE 802.15.3c
  - IEEE 802.11ad

$$H = \left[ \begin{array}{rrrr} 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 0 & 1 & 0 \end{array} \right]$$





### **Partial-Parallel Decoders**

- A subset of check and variable nodes are implemented in hardware
  - Processing of the whole matrix done by changing interconnection between implemented nodes
  - A network of muxes is utilized
- This interconnection network results in:
  - Hardware overhead
    - 1344 x 4:1 muxes for a (672,588) LDPC code
  - High power dissipation
    - All muxes toggle over every cycle
  - Decline in throughput
    - In critical path of the signals



# Contribution

- A new decoding scheme is proposed
  - Based on matrix structure of codes in IEEE 802.15.3c and 802.11ad
  - Results in almost complete elimination of logic gates on routing network of the decoder
  - Improvement in area, power and throughput
  - No degradation in BER performance
- Class of matrices the method can be utilized for is defined
- Results for (672,588) LDPC code adopted in IEEE 802.15.3c are presented

### Outline

- Partial Parallel Decoders
- Layered Belief Propagation Decoding
- Split-Row Threshold Decoding
- Decoder Implementation and Results
- Conclusion and Future Directions

### Layered Normalized Min-Sum

- In this work, layered scheduling with normalized minsum as update procedure in check nodes is utilized.
- 2x improvement in convergence speed

for 
$$k = 0 : (Y - 1)$$
 do  
for  $i \in$  check nodes of  $L_k$  do

$$Q_{ij} = Q_j - R_{ij(old)} \tag{1}$$

$$R_{ij} = Sfactor_{MS} \times \prod_{\substack{j' \in V(i) \setminus j \\ \times \min_{j' \in V(i) \setminus j}} \operatorname{sign}(Q_{ij'})} (2)$$

$$Q_j = Q_{ij} + R_{ij} \tag{3}$$

end for end for

### Outline

- Partial Parallel Decoders
- Layered Belief Propagation Decoding
- Permutational Decoding
- Decoder Implementation and Results
- Conclusion and Future Directions

# Valid Mapping

- Assume a partitioning on columns along layers on rows
- A mapping from column groups of layer L1 to column groups of layer L2 is called valid if:
  - 1) It is one-to-one,
  - It maps every non-zero submatrix in layer L1 to an equal or an all-zero submatrix in layer L2,



MP valid from Layer 1 to Layer 2

### **Permutational Matrix**

- We call a parity-check matrix permutational if there exists a mapping and a sequence of all its layers such that the mapping is valid:
  - Between consecutive layers in the sequence.
  - From last layer to the first layer of the sequence.

$$H = \begin{bmatrix} A & B & C & D \\ D & A & B & C \\ C & D & A & B \\ B & C & D & A \end{bmatrix}$$
Layer 3  
Layer 4

Sequence = 1,2,3,4





#### Cycle 1:

Effective connection matrix: [ABCD]

(Fist  $N_c$  columns connected to check nodes through connection matrix A)

Layer processed:

Layer 1



#### Cycle 2:

Effective connection matrix: [DABC]

(Fist  $N_c$  columns connected to check nodes through connection matrix D)

Layer processed:

Layer 2



#### Cycle 3:

Effective connection matrix: [CDAB]

(Fist  $N_c$  columns connected to check nodes through connection matrix C)

Layer processed:

Layer 3



#### Cycle 4:

Effective connection matrix: [BCDA]

(Fist  $N_c$  columns connected to check nodes through connection matrix B)

Layer processed:Layer 4All layers are processed, outputs bits from VN's are in proper order.

### **General Architecture**

- Permutational matrix with  $Y \ge M_1$  rows and  $U \ge N_c$  columns
- The routing network is based on  $L_{max}$ , the layer with highest row degree.



### Characteristics of the architecture

- Number of implemented CN's: number of row in a layer  $(M_i)$
- Number of implemented VN's: number of columns  $(U \times N_c)$
- Almost no gates are needed in the routing network, only a constant wiring network is used.
- No need for shifting outputs or check node messages
  - Outputs can be registered at the end of last cycle in each iteration
  - *R<sub>i,i</sub>* values are registered internally
- The complexity of overall routing network is not dramatically changed.
  - The shifting network and the connection network based on L<sub>max</sub> are in series, and can be assumed as one overall routing network, comparable to any other v-to-c routing network in regular partial-parallel decoders, but with no gates.
- No effect on BER

### Outline

- Partial Parallel Decoders
- Layered Belief Propagation Decoding
- Permutational Decoding
- Decoder Implementation and Results
- Conclusion and Future Directions

### Implementation for IEEE 802.15.3c

- LDPC codes included in permutational matrix definition:
  - All code rates in IEEE 802.15.3c
  - All code rates in IEEE 802.11ad
- Here the architecture is implemented for (672,588) code in IEEE 802.15.3c.



(672,588) LDPC code

# **CMOS Implementation Results**

|                                          | ASSCC'10[1]           | ISCAS'11[2]             | CICC'07[3] | Regular Partial-<br>Parallel Architecture | Proposed<br>Architecture |
|------------------------------------------|-----------------------|-------------------------|------------|-------------------------------------------|--------------------------|
| CMOS fabrication process                 | 65 nm                 | 65 nm                   | 0.13 µm    | 65 nm                                     | 65 nm                    |
| Code Length                              | 672                   | 672                     | 660        | 672                                       | 672                      |
| Supported Code rates                     | 1/2, 5/8, 3/4,<br>7/8 | 1/2, 5/8,<br>3/4, 13/16 | 0.73       | 7/8                                       | 7/8                      |
| Input Quantization (bits)                | 6                     | 5                       | 4          | 6                                         | 6                        |
| Gate count (k)                           | 647                   | -                       | 690        | 138                                       | 125                      |
| Core area (mm <sup>2</sup> )             | 1.562                 | 1.3                     | 7.3        | 0.891                                     | 0.718                    |
| Max. clock frequency (MHz)               | 197                   | 150                     | 300        | 180.2                                     | 235                      |
| Max. Iteration Count (I <sub>max</sub> ) | 5                     | 15                      | 15         | 5                                         | 5                        |
| Throughput @ I <sub>max</sub> (Gbps)     | 5.79                  | 3.08                    | 2.44       | 6.05                                      | 7.9                      |

- LDPC processors laid out in 65 nm CMOS
- Standard cell with complete Place & Route design
- Critical step to properly evaluate wire routing congestion

# Conclusion

- A new LDPC decoding technique is presented and the class of codes the method can be utilized for is defined.
- The technique is implemented for (672,588) code adopted for IEEE 802.15.3c.
- The new architecture reduces the gates on the routing network of the decoder from 1344 4:1 muxes to 126 2:1 muxes.
- The decoding technique results in 30% improvement in throughput and 24% decrease in area, with no effect on BER performance.

### Acknowledgements

### Support

- ST Microelectronics
- NSF Grant 430090 and CAREER award 546907
- NSF Grant 903549 and 1018972
- Intel Corporation
- SRC GRC Grant 1598, CSR Grant 1659 and GRC Grant 1971
- Intellasys
- UC Micro
- C2S2 Focus Center, one of six research centers funded under the Focus Center Research Program (FCRP), a Semiconductor Research Corporation entity