





## An Asynchronous Routing Algorithm for Clos Networks

Wei Song and Doug Edwards

The Advanced Processor Technologies Group (APT) School of Computer Science The University of Manchester {songw, doug}@cs.man.ac.uk

Advanced Processor Technologies Group The School of Computer Science

## Outline

- Introduction of Clos networks and asynchronous circuits
- Classic Synchronous dispatching algorithms
  - Random dispatching (RD)
  - Concurrent Round-Robin Dispatching (CRRD)
- Asynchronous Dispatching (AD) algorithm
- Implementation
- Outcome

MANCHESTER

The University of Mancheste

– 32-port Clos network. Setup 6.2 ns, release 3.9 ns.

# **Switching Networks**

Telephone networks

MANCHESTEF

The University of Mancheste

- Asynchronous transfer mode (ATM) and IP networks
  - ATLANTA chip (622Mb/s/port, 1997)
  - Petastar Optical Switch (160Gb/s/port, 2003)
- Intra/inter chip interconnect
  - High-radix router with many narrow ports
    SDM: delay guaranteed services

## Asynchronous Circuit

- Asynchronous vs. synchronous
  - -Low power (clockless)
  - Possible performance boost (average latency)
  - Tolerance to process variation
- Implementation style

MANCHESTER

The Universit<sup>i</sup> of Mancheste

- 2-phase vs. 4-phase
- Bundled-data vs. quasi-delay insensitive (QDI)

#### -4-phase QDI



- An area-efficient and high-speed switching network for on-chip networks
  - High-radix router
  - Low-power consumption
  - Tolerance to process variation
  - Area efficient

MANCHESTER

The University of Manchestel The University of Manchester MANCHESTER 1824

## **Clos Switching Network**



3 stages: IMs, CMs and OMs

C(n, k, m): k IM/OMs n ports per IM/OMm CMs

LI links between IM/CM LO links between CM/OM

SMS: buffer CM MSM: buffer IM/OM *S*<sup>3</sup>: no buffer

# **Non-Blocking**

Blocking

MANCHESTER

The University of Mancheste

- An available input/output pairs may not be connected due to internal blocking.
- Strict Non-Blocking (SNB)
  - All available input/output pairs are connectable.
  - $-m \ge 2n-1$
- Rearrangeable Non-Blocking (RNB)
  - Connecting available input/output pairs may require internal permutation.
  - $-n \le m < 2n 1$

### **Area Consumption**



# **Routing Algorithms**

Optimal algorithms

MANCHESTER

The University of Mancheste

- Algorithms provide guaranteed results for all matches but with a higher complexity in time and implementation.
- Heuristic algorithms
  - Algorithms provide all or partial connections in much lower time complexity.
- Most dynamically reconfigurable Clos networks use heuristic algorithms

MANCHESTER 1824

# **Dispatching Algorithms**



Every Input/output pair has *m* paths. Packets are dispatched evenly to all CMs.

#### **Dispatching algorithm**:

the algorithm used in IM to CM packet dispatching.

Advanced Processor Technologies Group The School of Computer Science

# **Random Dispatching (RD)**

Phase 1

MANCHESTER 1824

- IPs send
   requests to
   LIs
- LIs select IPs
- Phase 2
  - CMs select
     LIs
  - Configure granted IPs



Advanced Processor Technologies Group The School of Computer Science

# **Random Dispatching (RD)**

Phase 1

MANCHESTER 1824

- IPs send
   requests to
   LIs
- LIs select IPs
- Phase 2
  - CMs select
     LIs
  - Configure granted IPs



Advanced Processor Technologies Group The School of Computer Science

#### The University of Manchester

# **Random Dispatching (RD)**

Phase 1

MANCHESTER

- IPs send
   requests to
   LIs
- LIs select IPs
- Phase 2
  - CMs select
     LIs
  - Configure granted IPs



RD cannot resolve contention in IMs.

Phase 1

The University of Manchester

- IPs request
- LIs select
- IPs select
- Go back
   (iteration)
- Phase 2
  - Same as RD



Phase 1

The University of Manchester

- IPs request
- LIs select
- IPs select
- Go back
   (iteration)
- Phase 2
  - Same as RD



Phase 1

The University of Manchester

- IPs request
- LIs select
- IPs select
- Go back
   (iteration)
- Phase 2
  - Same as RD



Advanced Processor Technologies Group The School of Computer Science

Phase 1

The University of Manchester

- IPs request
- LIs select
- IPs select
- Go back (iteration)
- Phase 2
  - Same as RD



Phase 1

The University of Manchester

- IPs request
- LIs select
- IPs select
- Go back (iteration)
- Phase 2
  - Same as RD



Phase 1

The University of Mancheste

- IPs request
- LIs select
- IPs select
- Go back
   (iteration)
- Phase 2
  - Same as RD



#### CRRD cannot resolve contention in CMs.

Advanced Processor Technologies Group The School of Computer Science

Difficulties

MANCHESTER

The University of Manchester

- Packets arrive asynchronously
- Modules are eventdriven
- Orphans are generated by failed requests

- Solution
  - Independent algorithms (allocators) run in IMs and CMs
  - Failed requests use status feedback from CMs as acknowledge.

IM alg.

MANCHESTER

The University of Manchester

- IPs request
- LIs select
- OK? Send data
- Fail? Go back
- CM alg.
  - CMs grant LIs
  - Update status feedback

The School of Computer Science



• IM alg.

MANCHESTER

The University of Mancheste

- IPs request
- LIs select
- OK? Send data
- Fail? Go back
- CM alg.
  - CMs grant LIs
  - Update status feedback



• IM alg.

MANCHESTER 1824

The University of Manchester

- IPs request
- LIs select
- OK? Send data
- Fail? Go back
- CM alg.
  - CMs grant LIs
  - Update status feedback



Advanced Processor Technologies Group The School of Computer Science

# **Comparison of Algorithms**

- Random Dispatching (AD)
   Cannot resolve contention in IMs or CMs.
- Concurrent Round-Robin Dispatching (CRRD)
  - Handle contention in IMs using iterations.
  - Cannot resolve contention in CMs.
- Asynchronous Dispatching (AD)
  - Contention in IMs is handled by asynchronous arbiter naturally.
  - Handle contention in CMs using status feedback.

Advanced Processor Technologies Group The School of Computer Science

MANCHESTER

The University of Manchester

#### MANCHESTER 1824

### **Scheduler** Architecture



Advanced Processor Technologies Group The School of Computer Science

## **CM Dispatcher/OM Scheduler**



IM Sch(0) CM\_Sch(0) InpG0 OM\_Sch RIM RCM (0) InpG1 InpG2 IM Dis CM Dis InpG3 OM\_Sch (j) req(i,1) + IM\_Sch(i) CM\_Sch(r reg(i,2) : rea(i.3) OM\_Sch reg(7,0) • rea(7.1) + IM\_Sch(7) CM\_Sch(3 (7) req(7,2) +

C(4,8,4)

CMa: CM ack CMs: CM status feedback CMa and CMs are mutually exclusive

Advanced Processor Technologies Group The School of Computer Science

MANCHESTER 1824

The University of Mancheste

### **IM Dispatcher**



Advanced Processor Technologies Group The School of Computer Science June 23rd 2010

MANCHESTER 1824



## IM/CM Match Arbiter (Multi-resource Arbiter) ((4,8,4)



• S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, "Modular approach to multi-resource arbiter design," ASYNC 2009.

Advanced Processor Technologies Group The School of Computer Science

MANCHESTER 1824

### **Request Generation**

**C**(4,8,4)



Advanced Processor Technologies Group The School of Computer Science

#### **Configuration Generation** C(4,8,4)



Advanced Processor Technologies Group The School of Computer Science

MANCHESTER 1824

The University of Manchester



## **Area and Speed**

|           | area              | gate count | percent |
|-----------|-------------------|------------|---------|
|           | $(\mathrm{um}^2)$ | (NAND2X1)  |         |
| InpG      | 6.26K             | 1.6K       | 2.5%    |
| IM_Dis    | 175.06K           | 43.8K      | 67.2%   |
| CM_Dis    | 33.75K            | 8.4K       | 12.9%   |
| OM_Sch    | 10.06K            | 2.5K       | 3.8%    |
| RIM & RCM | 13.46K            | 3.4K       | 5.2%    |
| other     | 22.15K            | 5.5K       | 8.4%    |
| Total     | 260.74K           | 65.2K      |         |

Dispatching:

setup 4.6 ns release 2.9 ns OM Scheduler:

setup 1.6 ns release 1.0 ns Total:

```
setup 6.2 ns release 3.9 ns
```

Advanced Processor Technologies Group The School of Computer Science



C(4,8,4)

The University of Manchester MANCHESTER 1824

## **MATLAB Simulation Model**



Advanced Processor Technologies Group The School of Computer Science

MANCHESTER 1824

## **Throughput Analysis**



C(4,8,4) non-blocking load RD 55% CRRD (i=4) 70% AD (i=3) 76%

Advanced Processor Technologies Group The School of Computer Science



## Conclusions

- The first asynchronous routing algorithm for general 3-stage Clos networks.
- Better throughput performance than RD and CRRD
- Future works

MANCHESTER

The University of Manchestel

- Optimize the Clos structure
- Reduce area and latency

MANCHESTER 1824

## Thank you!

#### **Questions?**

Advanced Processor Technologies Group The School of Computer Science



#### Iterations

MANCHESTER 1824

The University of Manchester

12

#### **Uniform Traffic**



MANCHESTER 1824

Advanced Processor Technologies Group The School of Computer Science

## **Results after Optimization**

• Async

MANCHESTER

- Setup 3.7 ns
- Reset 3.2 ns
- Period ~ 7 ns (140M)
- 28K Gates
- Sync
  - 350 MHz
  - Cell time = Iter+1 =5 (70M)
  - 22K Gates

- Optimization
  - Replace multiresource arbiter
  - Eager request (InpG)
  - MUTEX arbiter
  - Optimized tree arbiter