A Pin- and Power-Efficient 20.83Gb/s/wire 0.94pJ/bit Forwarded Clock CNRZ-5 Coded SerDes up to 12mm for MCM Packages in 28nm CMOS

A. Shokrollahi<sup>1</sup>, D. Carnelli<sup>1</sup>, J. Fox<sup>2</sup>, K. Hofstra<sup>1</sup>, B. Holden<sup>1</sup>,
A. Hormati<sup>1</sup>, P. Hunt<sup>2</sup>, M. Johnston<sup>1</sup>, S. Pesenti<sup>1</sup>, R. Simpson<sup>2</sup>,
D. Stauffer<sup>1</sup>, A. Stewart<sup>2</sup>, G. Surace<sup>2</sup>, A. Tajalli<sup>1</sup>, O. Talebi
Amiri<sup>1</sup>, A. Tschank<sup>2</sup>, R. Ulrich<sup>1</sup>, Ch. Walter<sup>1</sup>, F. Licciardello<sup>1</sup>,
Y. Mogentale<sup>1</sup>, A. Singh<sup>1</sup>

<sup>1</sup>Kandou Bus, Lausanne, Switzerland, <sup>2</sup>Kandou Bus, Northampton, United Kingdom,

# Outline

- Introduction and motivation
- Signaling
- Macro architecture
  - Common block
  - Tx
  - Rx
- System Implementation
- Results
- Conclusion

## Motivation



 2.5D integration is a means to connect multiple die in a low-cost package

- Increase yield, lower manufacturing costs

 Efficient solution requires high-bandwidth, low-power SerDes

## Constraints of a Versatile 2.5D Solution

- Want very high throughput using few wires
  - Enable use of low cost of substrates
  - Use 20+ Gb/s per signal wire
  - Need very high bandwidth per mm die-edge
- Requires relatively large distance
  - Need up to 12 mm, scalability to longer distances desirable
- Demands very low power
  - Ideally same as or lower than on-chip links

### 2.5D Options

| Method                 | # wires | Distance   | Power |
|------------------------|---------|------------|-------|
| Silicon interposers    | Many    | Very short | Good  |
| Wafer level Processing | Many    | Very short | Good  |
| SerDes                 | Fewer?  | Longer?    | ?     |

- Need SerDes with
  - Very high throughput per wire
  - Very low power at desired channel lengths

### SerDes: Prior Work

| Reference     | [1]     | [2]      | [3]     | This work |
|---------------|---------|----------|---------|-----------|
| Year          | 2009    | 2012     | 2014    | 2016      |
| pJ/bit        | 1.9     | 0.54     | 1.4     | 0.94      |
| BW/pin (Gb/s) | 8.9     | 20       | 6       | 20.83     |
| Technology    | 45nm    | 28nm     | 32nm    | 28nm      |
| Signaling     | D       | GRSE     | D       | CNRZ-5    |
| Channel loss  | ≤ 20 dB | ≤ 1 dB   | ≤ 3 dB  | ≤ 3 dB    |
| Substrate     | SiC     | MCM      | Meg6    | МСМ       |
| Reach         | ≤ 40 mm | ≤ 4.5 mm | ≤ 19 mm | ≤ 12 mm   |

SiCa=Silicon carrier, SE = single ended, D = differential, GRSE=ground referenced single-ended, MCM = MCM organic substrate, Meg 6 = Megtron 6

# Outline

- Introduction and motivation
- Signaling
- Macro architecture
  - Common block
  - Tx
  - Rx
- System Implementation
- Results
- Conclusion

## Signaling, Implementation

- Choice of signaling scheme is a major ingredient for realizing performance targets, especially power
  - Signaling needs to have high pin-efficiency and built-in robustness to noise (common mode, SSO, ISI, EMI, etc.)



 Correlated signals on wires, matched comparator network

- Signals on wires belong to a codebook

• Chordal code = codebook + comparators

# Chord Signaling

- Correct design of code
  - Increases throughput per pin
  - Reduces ISI
  - Eliminates SSO noise
  - Eliminates common mode noise
  - Reduces EMI noise

## CNRZ-5

- Transmits 5 bits on a collection of 6 wires in every UI.
- Codewords are judiciously chosen permutations of [+1,-1,+1/3,+1/3,-1/3].
- Five comparators:
  - Two compare one wire value against another,
  - Two compare average of two wire values against a third,
  - One compares average of three wire values against the other three.

#### CNRZ-5



#### CNRZ-5







### **CNRZ-5** Codebook

[1/3. -1/3. -1. -1/3. 1/3. 1] [1, 1/3, -1/3, -1, -1/3, 1/3] [-1/3, -1, 1/3, -1/3, 1/3, 1][1/3, -1/3, 1, -1, -1/3, 1/3] [-1/3, 1/3, -1, -1/3, 1/3, 1] [1/3, 1, -1/3, -1, -1/3, 1/3] [-1, -1/3, 1/3, -1/3, 1/3, 1] [-1/3, 1/3, 1, -1, -1/3, 1/3] [1/3, -1/3, -1, 1, -1/3, 1/3] [1, 1/3, -1/3, 1/3, -1, -1/3] [-1/3, -1, 1/3, 1, -1/3, 1/3] [1/3, -1/3, 1, 1/3, -1, -1/3] [-1/3, 1/3, -1, 1, -1/3, 1/3] [1/3, 1, -1/3, 1/3, -1, -1/3] [-1, -1/3, 1/3, 1, -1/3, 1/3] [-1/3, 1/3, 1, 1/3, -1, -1/3]

[1/3. -1/3. -1. -1/3. 1. 1/3] [1, 1/3, -1/3, -1, 1/3, -1/3] [-1/3, -1, 1/3, -1/3, 1, 1/3] [1/3, -1/3, 1, -1, 1/3, -1/3] [-1/3, 1/3, -1, -1/3, 1, 1/3] [1/3, 1, -1/3, -1, 1/3, -1/3] [-1, -1/3, 1/3, -1/3, 1, 1/3] [-1/3, 1/3, 1, -1, 1/3, -1/3] [1/3, -1/3, -1, 1, 1/3, -1/3] [1, 1/3, -1/3, 1/3, -1/3, -1] [-1/3, -1, 1/3, 1, 1/3, -1/3] [1/3, -1/3, 1, 1/3, -1/3, -1] [-1/3, 1/3, -1, 1, 1/3, -1/3] [1/3, 1, -1/3, 1/3, -1/3, -1] [-1, -1/3, 1/3, 1, 1/3, -1/3] [-1/3, 1/3, 1, 1/3, -1/3, -1]

### **CNRZ-5** Encoder



 Computes two bits Wi[0], Wi[1], i=0..5 from input bits b0,...,b4, which are used by Tx output driver to create codeword values w0,...,w5 on wires

## **Binary Values at Input to Slicers**

- Core concept is that of ISI-ratio [4], [5].
- This property massively reduces ISI noise

| av           | erage           | avera           | ge   |
|--------------|-----------------|-----------------|------|
| $\checkmark$ |                 |                 |      |
| -1/3         | [1/3, -1/3, -1, | -1/3, 1/3, 1    | 1/3  |
| 1/3          | [1, 1/3, -1/3,  | -1, -1/3, 1/3 ] | -1/3 |
| -1/3         | [-1/3, -1, 1/3, | -1/3, 1/3, 1 ]  | 1/3  |
| 1/3          | [1/3, -1/3, 1,  | -1, -1/3, 1/3 ] | -1/3 |
| -1/3         | [-1/3, 1/3, -1, | -1/3, 1/3, 1 ]  | 1/3  |
| 1/3          | [1/3, 1, -1/3,  | -1, -1/3, 1/3 ] | -1/3 |
| -1/3         | [-1, -1/3, 1/3, | -1/3, 1/3, 1 ]  | 1/3  |
| 1/3          | [-1/3, 1/3, 1,  | -1, -1/3, 1/3 ] | -1/3 |
| -1/3         | [1/3, -1/3, -1, | 1, -1/3, 1/3 ]  | 1/3  |
| 1/3          | [1, 1/3, -1/3,  | 1/3, -1, -1/3]  | -1/3 |
| -1/3         | [-1/3, -1, 1/3, | 1, -1/3, 1/3 ]  | 1/3  |
| 1/3          | [1/3, -1/3, 1,  | 1/3, -1, -1/3]  | -1/3 |
| -1/3         | [-1/3, 1/3, -1, | 1, -1/3, 1/3 ]  | 1/3  |
| 1/3          | [1/3, 1, -1/3,  | 1/3, -1, -1/3]  | -1/3 |
| -1/3         | [-1, -1/3, 1/3, | 1, -1/3, 1/3 ]  | 1/3  |
| 1/3          | [-1/3, 1/3, 1,  | 1/3, -1, -1/3]  | -1/3 |

| av   | erage           | avera          | ge   |
|------|-----------------|----------------|------|
| ↓    |                 |                | ↓    |
| -1/3 | [1/3, -1/3, -1, | -1/3, 1, 1/3]  | 1/3  |
| 1/3  | [1, 1/3, -1/3,  | -1, 1/3, -1/3] | -1/3 |
| -1/3 | [-1/3, -1, 1/3, | -1/3, 1, 1/3]  | 1/3  |
| 1/3  | [1/3, -1/3, 1,  | -1, 1/3, -1/3] | -1/3 |
| -1/3 | [-1/3, 1/3, -1, | -1/3, 1, 1/3]  | 1/3  |
| 1/3  | [1/3, 1, -1/3,  | -1, 1/3, -1/3] | -1/3 |
| -1/3 | [-1, -1/3, 1/3, | -1/3, 1, 1/3]  | 1/3  |
| 1/3  | [-1/3, 1/3, 1,  | -1, 1/3, -1/3] | -1/3 |
| -1/3 | [1/3, -1/3, -1, | 1, 1/3, -1/3]  | 1/3  |
| 1/3  | [1, 1/3, -1/3,  | 1/3, -1/3, -1] | -1/3 |
| -1/3 | [-1/3, -1, 1/3, | 1, 1/3, -1/3]  | 1/3  |
| 1/3  | [1/3, -1/3, 1,  | 1/3, -1/3, -1] | -1/3 |
| -1/3 | [-1/3, 1/3, -1, | 1, 1/3, -1/3]  | 1/3  |
| 1/3  | [1/3, 1, -1/3,  | 1/3, -1/3, -1] | -1/3 |
| -1/3 | [-1, -1/3, 1/3, | 1, 1/3, -1/3]  | 1/3  |
| 1/3  | [-1/3, 1/3, 1,  | 1/3, -1/3, -1] | -1/3 |

# Signal Integrity through Coding

| CM Noise                                                                                                                 | EMI                                     | Crosstalk                                                                                             |
|--------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|-------------------------------------------------------------------------------------------------------|
| <ul> <li>Comparators<br/>designed to tolerate<br/>common mode noise</li> <li>Balanced values<br/>across wires</li> </ul> | Codewords designed to minimize EM-field | <ul> <li>Code designed to tolerate some crosstalk</li> <li>Complemented by sensible design</li> </ul> |
|                                                                                                                          |                                         |                                                                                                       |
| SSO Noise                                                                                                                | ISI                                     | <b>Reference-less</b>                                                                                 |

# Outline

- Introduction and motivation
- Signaling
- Macro architecture
  - Common block
  - Tx
  - Rx
- System Implementation
- Results
- Conclusion







© 2016 IEEE International Solid-State Circuits Conference



© 2016 IEEE International Solid-State Circuits Conference

## Common Block

- CmIP consists of:
  - Main PLL
    - Ring Oscillator, produces 2 phase 8UI clock (3.125GHz)
    - Programmable to cover half rate to full rate links (25Gb/s to 12Gb/s)
  - Bandgap
  - Temperature sensor

#### Transmitter











International Solid-State Circuits Conference

SerDes up to 12mm for MCM Packages in 28nm CMOS







### **Transmitter Blocks**

• SST output driver



### **Transmitter Blocks**



### **Transmitter Blocks**

- SST output driver
  - 4 levels, 300mVpp
  - Balanced, SSO free


# **Transmitter Blocks**

- SST output driver
  - 4 levels, 300mVpp
  - Balanced, SSO free
  - 75 ohms
    - 125Ω || 190Ω
  - Vcm=Vdd/2
  - ¼ rate arch



# **Transmitter Blocks**

- SST output driver
  - 4 levels, 300mVpp
  - Balanced, SSO free
  - 75 ohms
    - 125Ω || 190Ω
  - Vcm=Vdd/2
  - ¼ rate arch
- CMOS output in JTAG/test mode



- Continuous Time Front End:
  - DC coupled, with level shifter
  - T-Coils for passive equalization
  - Multi-input gain stage (decoder) that performs linear combination of incoming signal

- Continuous Time Front End:
  - DC coupled, with level shifter
  - T-Coils for passive equalization
  - Multi-input gain stage (decoder) that performs linear combination of incoming signal
- Discrete Time Front End:
  - 4-ph data sampling system, followed by 4:32 demux

- Continuous Time Front End:
  - DC coupled, with level shifter
  - T-Coils for passive equalization
  - Multi-input gain stage (decoder) that performs linear combination of incoming signal
- Discrete Time Front End:
  - 4-ph data sampling system, followed by 4:32 demux
- Clock data alignment block (CDA) is used to align sampling clock edge to center of data eye

- Continuous Time Front End:
  - DC coupled, with level shifter
  - T-Coils for passive equalization
  - Multi-input gain stage (decoder) that performs linear combination of incoming signal
- Discrete Time Front End:
  - 4-ph data sampling system, followed by 4:32 demux
- Clock data alignment block (CDA) is used to align sampling clock edge to center of data eye
- Wide-band PLL to track correlated jitter (data/clk)
  - Uses 8UI fwd clock as ref clock

## **Receiver Block Diagrams**

### **Receiver Block Diagrams**



Paper 10.1: A Pin- and Power-Efficient 20.83Gb/s/wire 0.94pJ/bit Forwarded Clock CNRZ-5 Coded SerDes up to 12mm for MCM Packages in 28nm CMOS



SerDes up to 12mm for MCM Packages in 28nm CMOS





























# Outline

- Introduction and motivation
- Signaling
- Macro architecture
  - Common block
  - Tx
  - Rx
- System Implementation
- Results
- Conclusion

# **Bump Map and Technology**



### **Process:**

- TSMC 28nm HPM
- Metal stack is 9 layer: 6X2R + 1 UTALRDL
- DGO process (dual gate oxide, 1.0V and 1.8V devices).
- Devices used: HVT, LVT and nominal VT (3 types)

# Bump Map and Technology

- vss ground
  - vdda Rx/PLL analog
    - power (1.0 V)
  - vddh PLL power supply
    - (1.5V-1.8V) vddd – macro digital
      - power supply (1.0V)

- Anabus signal
- Tx FCLK
- Tx signals
- Rx FCLK
- Rx signals
- $\bigcirc$  Unused

### Process:

- TSMC 28nm HPM
- Metal stack is 9 layer: 6X2R + 1 UTALRDL
- DGO process (dual gate oxide, 1.0V and 1.8V devices).
- Devices used: HVT, LVT and nominal VT (3 types)

# Testchip



- Architecture:
  - One instance of CNRZ-5 IP (Tx + Rx)
  - One common block
  - 62.5 Gb/s and 125 Gb/s over six wires
- Die size (actual without scribe) 2138.4µm x 1386.9µm (2.96sqmm)

# Chip Micrograph



## Features

| Technology                  | 28nm CMOS HPM, VDD=1.0V, 9M, DGO                                                                                                                                |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MCM Channels                | Losses (s21) of ~0.6dB, ~1.25dB and ~2.5dB                                                                                                                      |
| Data Rate                   | 10.44-20.83 Gb/s/wire (12.5-25 Gbaud)                                                                                                                           |
| Power and Energy Efficiency | 117.5 mW at 125 Gbps                                                                                                                                            |
| BER                         | < 1e-15 at 25 Gbaud                                                                                                                                             |
| Testability                 | <ul> <li>IP: Internal loopback; Rx Eyescope;</li> <li>PRBS31 Pattgen and Verification; Analog test bus.</li> <li>Testchip: Pattern generators; Noise</li> </ul> |
|                             | Generators.                                                                                                                                                     |

# Demo MCM



- Organic GX13 substrate, 222 stack up.
- 19mm square, 18 x 18 BGA, 1mm pitch.
- Populate up to 4 GW die interconnected in pairs.
- Provides up to 3 channels of 5mm, 12mm and 24mm.
- Channel trace length matched to  $\sim 1 \mu m$ .
- Channel losses (s21) of ~0.6dB, ~1.25dB and ~2.5dB.
- CNRZ-5 channels, 50Ohms:
  - Trace width = 19.5µm
  - Space = 55.5µm
- Two die provide break out to Rx and Tx
   6-wire interfaces plus FCLK respectively.

# **Evaluation Board**

H+S Connectors. Type H&S MXP MCM is solder ball assembled onto a host PCB. Solder ball array 1mm pitch fully populated. Organic substrate.



- MCM body size
   19mm, 18 x 18
   BGA
- 25 Gb/s HS
  signals off-MCM
  to PCB and H
  +S connectors
  from TC2 and
  TC3
- Single PCB design capable of accepting a socket and soldered MCMs

# Outline

- Introduction and motivation
- Signaling
- Macro architecture
  - Common block
  - Tx
  - Rx
- System Implementation
- Results
- Conclusion

## Results



© 2016 IEEE International Solid-State Circuits Conference Paper 10.1: A Pin- and Power-Efficient 20.83Gb/s/wire 0.94pJ/bit Forwarded Clock CNRZ-5 Coded SerDes up to 12mm for MCM Packages in 28nm CMOS 70 of 76

# Measured Power at 125 Gb/s

|        |        | mW      | mA                      | V     |      |      |
|--------|--------|---------|-------------------------|-------|------|------|
| 8.48%  | 1.77%  | 2.084   | 2.253                   | 0.925 | VDDA |      |
|        | 6.69%  | 7.862   | 5.616                   | 1.400 | VDDH | CMIP |
|        | 0.02%  | 0.026   | 0.032                   | 0.800 | VDDD |      |
| 54.17% | 43.20% | 50.807  | 54.927                  | 0.925 | VDDA |      |
|        | 4.76%  | 5.594   | 3.996                   | 1.400 | VDDH | TXIP |
|        | 6.21%  | 7.300   | 9.125                   | 0.800 | VDDD |      |
| 37.35% | 24.50% | 28.816  | 31.152                  | 0.925 | VDDA |      |
|        | 9.73%  | 11.448  | 8.177                   | 1.400 | VDDH | RXIP |
|        | 3.12%  | 3.667   | 4.584                   | 0.800 | VDDD |      |
|        |        | 117.605 | P <sub>total</sub> [mW] |       |      |      |
|        |        | 125     | Rate [Gb/s]             |       |      |      |
|        |        | 0.941   | E <sub>bit</sub> [pJ/b] |       |      |      |

### Results



© 2016 IEEE International Solid-State Circuits Conference Paper 10.1: A Pin- and Power-Efficient 20.83Gb/s/wire 0.94pJ/bit Forwarded Clock CNRZ-5 Coded SerDes up to 12mm for MCM Packages in 28nm CMOS 72 of 76

15

10

floor

1.0

0.8

.0 F CDA probability

0.2

\_\_\_0.0 20
## Results (Continued)

- Target BER of 1e-15 achieved at 25 Gbaud
  - Specified 12mm MCM channel across power supply corners and temperature stress
- Target BER of 1e-15 achieved at half data rate 12.5 Gbaud on 24 mm channel
- Target BER achieved under stress test (±5% supply tolerance, temp range from 0 to 100 C)
- Target BER achieved at 30 Gbaud on 12mm channel under nominal power supplies and temperature

## Conclusion

- Demonstrated an implementation of a SerDes based on CNRZ-5 coding, transmitting 5 bits on 6 correlated wires in every UI
- Coding has built-in resilience to various types of noise which helps lower the power consumption at high speeds
- Implementation uses a forwarded clock architecture and transmits up to 125 Gb/s on 6 wires at 0.94 pJ/bit

## References

- [1] Kim et al., "A 10 Gb/s compact low-power serial I/O with DFE-IIR equalization in 65 nm CMOS", IEEE Journal of Solid State Circuits, vol. 44, pp. 3526-3538, 2009.
- [2] Poulton et al., "A 0.54pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Haul Serial Link in 28nm CMOS for Advanced Packaging Applications," IEEE Journal of Solid State Circuits, vol. 48, pp. 3207-3218, 2013.
- [3] Dickson et al., "A 1.4 pJ/bit, power scalable 16x12 Gb/s source-synchronous I/O with DFE receiver in 32 nm SOI CMOS technology," in *Proc. IEEE Custom Integrated Circuits Conf.*, 2014, pp. 10-5.
- [4] Hormati et al., "Method and Apparatus for Low-Power Chip-to-Chip Communications with Constrained ISI-Ratio", U.S. Patent 9,100,232.
- [5] A. Hormati and A. Shokrollahi, "ISI tolerant signaling: a comparative study of PAM4 and ENRZ," DesignCon 2016.

## Thank you