

#### Asynchronous SDM Router - results from ongoing research

#### Wei Song

Advanced Processor Technologies Group The School of Computer Science 1



### VC and SDM router

VC



2010/4/8

SDM



#### 2-stage Clos Network



35





Advanced Processor Technologies Group The School of Computer Science

MANCHESTER

## Questions from ASYNC'10

- Router Area: SDM vs. VC
- The area consumption of switch allocator
- Throughput: SDM vs. VC
- QoS support of SDM
- Wire efficiency on ports
- Area and latency models of Wormhole, SDM, and VC are extracted to answer these questions.



Area Model (1)



4-phase 1-of-4 QDI with common ACK

 $A_{pipe} = 2.5WA_C$ 

Advanced Processor Technologies Group The School of Computer Science 6



### Area Model (2)



$$A_{IB,WH} = L(2.5WA_{C} + A_{EOF}) + A_{RC} + A_{CTL} \qquad A_{OB,WH} = 2.5WA_{C} + A_{EOF}$$
$$A_{IB,SDM} = M[L(2.5\frac{W}{M}A_{C} + A_{EOF}) + A_{RC} + A_{CTL}] \qquad A_{OB,SDM} = 2.5WA_{C} + MA_{EOF}$$

Advanced Processor Technologies Group The School of Computer Science 7



Advanced Processor Technologies Group The School of Computer Science 8



#### Area Model (4)



 $A_{A,WH} = P^2 A_{arb}$ 

$$A_{A,SDM} = M^2 P^2 A_{arb}$$

MxN multi-resource arbiter **Includes MN tiles** 

### Area Results

Table 1: Area consumption  $(\mu m^2)$ 

|                  | WH     | $\operatorname{err}(\%)$ | SDM    | $\operatorname{err}(\%)$ | <b>SDMCS</b> | $\operatorname{err}(\%)$ |
|------------------|--------|--------------------------|--------|--------------------------|--------------|--------------------------|
| Input Buffers    | 14,303 | 0.0                      | 21,995 | - <mark>0.4</mark>       | 25,953       | - <mark>0.1</mark>       |
| Output Buffers   | 5,935  | 0.0                      | 6,000  | 1.7                      | 6,540        | 3.4                      |
| Crossbar         | 4,356  | 0.0                      | 21,744 | -0.2                     | 28,992       | - <mark>0</mark> .2      |
| Switch Allocator | 772    | 78.2                     | 22,208 | - <mark>0</mark> .9      | 22,122       | - <mark>0.5</mark>       |
| Total            | 25,366 | 2.4                      | 71,956 | -0.3                     | 83,615       | 0.0                      |

P=5, L=2, W=32, M=4

Advanced Processor Technologies Group The School of Computer Science

MANCHESTER 1824



**The School of Computer Science** 

## **Critical Cycle Analysis**



Advanced Processor Technologies Group The School of Computer Science

The University of Manchester



### **Speed Performance**

|                  | WH   | err  | SDM  | err  | SDMCS | err  | VC   |
|------------------|------|------|------|------|-------|------|------|
| Period           | 4.25 | 2.6  | 4.15 | -3.4 | 3.12  | 3.8  | 5.23 |
| Latency          | 2.29 |      | 2.49 |      | 2.66  |      | N/A  |
| Routing          | 0.44 |      | 0.51 |      | 0.50  |      | 0.44 |
| Allocation       | 0.78 |      | 3.21 |      | 3.28  |      | 3.21 |
| t <sub>c</sub>   | 0.22 | -9.1 | 0.34 | -5.9 | 0.29  | 10.3 | 0.20 |
| t <sub>CB</sub>  | 0.16 | 1.3  | 0.26 | -3.8 | 0.24  | 4.2  | 0.16 |
| t <sub>CD</sub>  | 0.79 | 7.6  | 0.57 | 4.2  | 0.30  | -2.0 | 0.89 |
| $t_{AD}$         | 0.57 | 6.1  | 0.27 | -0.4 | 0.17  | 8.8  | 0.61 |
| t <sub>CTL</sub> | 0.00 |      | 0.00 |      | 0.00  |      | 0.78 |

Unit: ns

Advanced Processor Technologies Group The School of Computer Science



## Simulation Configuration

- Latency accurate SystemC models
- Wormhole, SDM, SDM+CS, VC
- 8x8, 5 ports, XY routing
- 32-bit, 4 VCs/virtual circuits



MANCHESTER

#### Injected Traffic vs. Latency



L=2, W=32, FL=64

VC router with L=2 suffers from credit loop stall.

Both SDM and SDMCS outperform VC.

Wormhole, SDM and SDMCS have constant data transmission latency.

Advanced Processor Technologies Group The School of Computer Science



### **Payload and Hop Count**



All routers approach the maximal throughput with longer payload length. FL=64 Byte shows 90% maximal throughput. Throughput decreases with the increasing hop count. SDM shows better through even in the 8-hop case.

Advanced Processor Technologies Group The School of Computer Science



### **Buffer Length**



The School of Computer Science





#### Number of VCs/Virtual Circuits



19

Throughput increment from 2VC to 4VC
VC 20% SDM 22.5% SDMCS 6.7%

Advanced Processor Technologies Group The School of Computer Science



#### Reduce the latency



• The frame latency can be reduced significantly if a frame is divided and delivered by two virtual circuits concurrently.



Advanced Processor Technologies Group The School of Computer Science 21



## Conclusion

- SDM+CS achieves the best performance and the best Gain (except wormhole) for best-effort traffic.
- SDM+CS has smaller area than VC with the same configuration.
- SDM has the potential ability to support hard delay guaranteed services.

# Ongoing Work

- The VC designed by Tomaz Felicijan (QoS Router) has been implemented again.
- Prove the throughput improvement by probability theories.
- Estimate the theoretical throughput bound of the 2-stage Clos network and optimize it.
- Reduce the area consumption of the Clos allocator.

MANCHESTER

The University of Manchester



## Thanks! Question?

Advanced Processor Technologies Group The School of Computer Science