

The University of Mancheste





Wei Song and Doug Edwards The Advanced Processor Technologies Group (APT) School of Computer Science The University of Manchester {songw, doug}@cs.man.ac.uk

Advanced Processor Technologies Group The School of Computer Science

#### Outline

Introduction

MANCHESTER

The University of Manchester

- Network-on-Chips (NoCs)
- Flow control: wormhole, virtual channel (VC) and spatial division multiplexing (SDM)
- SDM router
  - Implementation
  - Area and speed model
- Speculation of a VC router
  - Area and speed model
- Performance analysis
  - Latency accurate SystemC models



#### **Network-on-Chips**



# The University of Manchester

MANCHESTER

### Virtual Channel (VC)



**Advanced Processor Technologies Group** The School of Computer Science

June 30th 2010

А

#### The University of Mancheste 0.90 wormhole PxP --- - SDM NPxNP 0.85 Saturation Throughput 0.80 Switch Allocator 0.75 $OP_0$ $IP_0$ • M 0.70 w/m 0.65 0.60 2 6 8 $IP_{(P-1)}$ OP<sub>(P-</sub> virtual circuits per port (N) • M MPxMP W/M $th_{wormhole} = 0.67$ $th_{SDM} (M=4) = 0.83$

**Spatial Division Multiplexing (SDM)** 



The University of Mancheste

#### VC vs. SDM

- VC
  - Extra virtual channels (buffer)
  - An extra VC allocator
  - Increased crossbar
  - ANoC, QoS NoC, MANGO, QNoC
- SDM
  - Increased crossbar plus extra control logic
  - No asynchronous implementation

#### **Input/Output Buffer**



MANCHESTER 1824



 $A = 2.5WA_{C}L$ 

Advanced Processor Technologies Group The School of Computer Science

## **Input/Output Buffer**



**Advanced Processor Technologies Group** The School of Computer Science

MANCHESTER 1824



#### Crossbar



**Advanced Processor Technologies Group** The School of Computer Science

#### **Switch Allocator**



S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, "Modular approach to ٠ multi-resource arbiter design," ASYNC 2009.

**Advanced Processor Technologies Group** 10 The School of Computer Science

MANCHESTER 1824

#### **Switch Allocator**



MANCHESTER 1824

The University of Manchester MANCHESTER 1824

#### Area Consumption

• Wormhole  $A_{IB,WH} = L(2.5WA_{C} + A_{EOF}) + A_{RC} + A_{CTL} \qquad A_{AOB,WH} = 2.5WA_{C} + A_{EOF} \qquad A_{CB,WH} = (2W + 2)(2P^{2} - P)A_{g} \qquad A_{A,WH} = P^{2}A_{arb} \qquad A_{$ 

• SDM  

$$A_{IB,SDM} = M[L(2.5\frac{W}{M}A_{C} + A_{EOF}) + A_{RC} + A_{CTL}]$$

$$A_{OB,SDM} = 2.5WA_{C} + MA_{EOF}$$

$$A_{CB,SDM} = (\frac{2W}{M} + 2)(2M^{2}P^{2} - MP)A_{g}$$

$$A_{A,SDM} = M^{2}P^{2}A_{arb}$$

#### The University of Manchester

MANCHESTER 1824

#### **Area Consumption**

|                       | WH     | err(%) | SDM    | err(%) |
|-----------------------|--------|--------|--------|--------|
| Input Buffers         | 14,303 | 0.0    | 21,995 | -0.4   |
| <b>Output Buffers</b> | 5,935  | 0.0    | 6,000  | 1.7    |
| Crossbar              | 4,356  | 0.0    | 21,744 | -0.2   |
| Switch Allocator      | 772    | 78.2   | 22,208 | -0.9   |
| Total                 | 25,366 | 2.4    | 71,956 | -0.3   |





#### **Critical Cycle**



#### **Critical Cycle**



$$T = 4t_{C} + 4t_{CB} + 2t_{CD} + 2t_{AD} + t_{CTL}$$

$$t_{C} = \begin{cases} l_{C} + k_{C}(P+1) & \text{wormhole,} \\ l_{C} + k_{C}(MP+1) & SDM. \end{cases}$$

$$t_{CB} = \begin{cases} l_{CB} + k_{CB} \log_{2}(P) & \text{wormhole,} \\ l_{C} + l_{C} \log_{2}(P) & \text{wormhole,} \end{cases}$$

$$\begin{cases} l_{AD} + k_{AD}(2W+1) \\ 2W \end{cases} \quad wormhole, \quad t_{CD} = \begin{cases} l_{CD} + l_C \log_2(\frac{W}{2}) + k_{CD}P & wormhole, \end{cases}$$

$$t_{AD} = \begin{cases} t_{AD} + k_{AD}(\frac{2W}{M} + 1) & SDM. \end{cases}, \quad t_{CD} = \begin{cases} t_{CD} + l_{C}\log_{2}(\frac{W}{2M}) + k_{CD}MP & SDM. \end{cases}$$

Advanced Processor Technologies Group The School of Computer Science June 30th 2010

MANCHESTER 1824



#### **Critical Cycle**

|                     | WH   | err  | SDM  | err(%) |
|---------------------|------|------|------|--------|
| cycle period        | 4.25 | 2.6  | 4.15 | -3.4   |
| router latency      | 2.29 |      | 2.49 |        |
| routing calculation | 0.44 |      | 0.51 |        |
| switch allocation   | 0.78 |      | 3.21 |        |
| $t_C$               | 0.22 | -9.1 | 0.34 | -5.9   |
| $t_{CB}$            | 0.16 | 1.3  | 0.26 | -3.8   |
| $t_{CD}$            | 0.79 | 7.6  | 0.57 | 4.2    |
| $t_{AD}$            | 0.57 | 6.1  | 0.27 | -0.4   |



Advanced Processor Technologies Group The School of Computer Science



#### **VC Router**

 $t_{C,VC} = t_{C,WH} < t_{C,SDM}$  $t_{CD,VC} = l_{CD} + l_C \log_2(W/2) + k_{CD}MP > t_{CD,WH} > t_{CD,SDM}$  $t_{AD,VC} = t_{AD,WH} > t_{AD,SDM}$  $t_{CB,VC} = t_{CB,WH} < t_{CB,SDM}$ cycle period = 5.23 ns routing calculation = 0.44 ns VC allocation = 3.21 ns switch allocation = 0.78 ns

### SystemC model

- Latency accurate SystemC models
- Wormhole, SDM, VC

MANCHESTER

The University of Manchester

- 8x8, 5 ports, XY routing
- 32-bit, 4 VCs/virtual circuits

The University of Manchester MANCHESTER

#### **Average Frame Latency**



L=2, W=32, FL=64

VC router with L=2 suffers from credit loop stall.

Both SDM and SDMCS outperform VC.

Wormhole, SDM and SDMCS have constant data transmission latency.

Advanced Processor Technologies Group The School of Computer Science

MANCHESTER

#### **Payload Size and Distance**



All routers approach the maximal throughput with longer payload length. FL=64 Byte shows 90% maximal throughput. Throughput decreases with the increasing hop count. SDM shows better through even in the 8-hop case

Advanced Processor Technologies Group The School of Computer Science





The School of Computer Science



**The School of Computer Science** 



#### Number of VCs





- Both VC and SDM improve throughput.
- SDM achieves better throughput performance and area to throughput gain than VC.
- SDM has the potential ability to support hard delay guaranteed services

MANCHESTER

The University of Manchester MANCHESTER 1824

#### Thanks!

#### **Question?**

Advanced Processor Technologies Group The School of Computer Science