# Synthesis and Performance Optimization of a Switching Nano-crossbar Computer

Marie Skłodowska-Curie grant agreement No 691178 (European Union's Horizon 2020 research and innovation programme)

#### Dan Alexandrescu<sup>1</sup>, Mustafa Altun<sup>2</sup>, Lorena Anghel<sup>3</sup>, Anna Bernasconi<sup>4</sup>, Valentina Ciriani<sup>5</sup>, *Luca Frontini*<sup>5</sup>, Mehdi Tahoori<sup>6</sup>

<sup>1</sup>IROC Technologies, Grenoble, France, dan.alexandrescu@iroctech.com

<sup>2</sup>Dept. of Electronics and Communication Engineering, Istanbul Technical University, Turkey, altunmus@itu.edu.tr

<sup>3</sup>TIMA laboratory Grenoble-Alpes University, France, lorena.anghel@imag.fr

<sup>4</sup>Dipartimento di Informatica, Università di Pisa, Italy, anna.bernasconi@unipi.it

<sup>5</sup>Dipartimento di Informatica, Università degli Studi di Milano, Italy, {valentina.ciriani, luca.frontini}@unimi.it

<sup>6</sup>Karlsruhe Institute of Technology, Karlsruhe, Germany, tahoori@ira.uka.de

### Introduction

#### CMOS technology

- Transistor size has shrunk for decades
- The trend reached a critical point

The Moore's Law era is coming  $\$  to an end

#### New emerging technologies

- Biotechnologies, molecular-scale self-assembled systems
- Graphene structures
- Switching lattices arrays

These technologies are in an early state

→ Ξ →

A novel approach to synthesis, design and testing is necessary, focused on the properties of the devices that are the main factors for a future technology choice

#### Our main goal is to design an emerging nanocomputer

Luca Frontini

Models based on diodes, FETs, and four-terminal switches



- < ∃ →

# The Switching Lattices

Switching Lattices are two-dimensional array of four-terminal switches

- When switches are ON all terminals are connected, when OFF all terminals are disconnected
- Each switch is controlled by a boolean literal, 1 or 0
- The boolean function *f* is the SOP of the literals along each path from **top** to **bottom**
- The function synthesized by the lattice is:

$$f = x_1 x_2 x_3 + x_1 x_2 x_5 x_6 +$$

 $+x_4x_5x_2x_3 + x_4x_5x_6$ 



# **Research Objectives**

Main objective: realizing a SSM with switching nanoarrays Long term vision: being a foundational methodology for future computing devices replacing CMOS



**①** Crossbar model to build arithmetic and memory elements, with focus on:

- size and power consumption
- delay and reliability

**2** Arithmetic and memory elements as building blocks of a computer:

- adders and multiplexers
- flip-flop & registers
- 3 Realize a nano-crossbar based synchronous state machine (SSM)
  - implementing arithmetic and logic elements
  - using technology parameters (area, delay, power, reliability)
  - the SSM uses a complete logic flow and clocked control over\_state\_registration

Our model target: nanowire/nanotubes crossbar arrays, magnetic switch-based structures and crossbar memories

#### 1 Models for switching nano-crossbars

- the previous models are developed for basic logic operations
- we generalize the model to be applicable to any Boolean function
- we improve the model for diode/transistor-based crossbar
- we develop a new model based on four-terminal switches

### **2** Performance optimization

- previous crossbar circuit implementation considers only area and reliability
- we consider also delay and power dissipation
- **3** Switching nano-crossbars circuit implementation
  - previous implementation performs simple arithmetic operations
  - this project will implement complex arithmetic synthesis and memory elements

伺 ト イヨト イヨト

### Research Methodology and Approach

- **Technological part:** model electrical and physical characteristics of the technologies
- **Computational part:** use of logic synthesis, graph theory, probability theory, CAD
- 1 Finding optimal crossbar sizes, modeling and optimization
  - graph theory and circuit complexity techniques
  - the problem is likely NP-complete
  - identify the problem as a Boolean satisfiability problem and try heuristic approaches

#### **2** Implementing arithmetic and memory elements

- compare results with CMOS
- determine the parameters of the technologies
- design a comprehensive optimization software package
- **3** Realizing a **Synchronous State Machine** 
  - programmable multi-array architecture
  - · composed only by four-terminal switches

A = A A =
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A

# Logic Synthesis: diode, FET, 4-terminal switches



- **Diode:** (number of products in f) × (number of literals in f + 1)
- **FET:** (number of literals in *f*) × (number of products in *f* + number of products in *f<sup>D</sup>*)
- 4-terminal:(number of products in f) × (number of products in  $f^D$ )

For example,  $f = x_1 \overline{x_2} + \overline{x_1} x_2$  size is:

- Diode:  $2 \times 5$
- FET: 4 × 4
- 4-terminal: 2 × 2

## Logic synthesis of a 4-terminal array

#### • the 4-switch dimension:

(number of products in f) × (number of products in  $f^D$ ) is not necessarily optimal.

- to find the optimal solution is necessary to enumerate all top-to-bottom paths
- this task grows exponentially with the array size.
- for these reasons we propose this, preliminary, brute-force **algorithm**
- we are now studying more efficient heuristics



- each output of benchmark circuit is considered as a separate function
- the number of products of f and f<sup>D</sup> are obtained using Espresso
- there is always the same sequence:  $A_{CMOS} \ge A_{diode} \ge A_{4-terminal} \ge A_{opt.4-terminal}$

| Benchmark | CMOS | Diode | 4-<br>Terminal | Optimal 4-<br>Terminal |
|-----------|------|-------|----------------|------------------------|
| Alu 0     | 30   | 18    | 6              | 6                      |
| Alu 1     | 30   | 18    | 6              | 6                      |
| Alu 2     | 30   | 18    | 6              | 6                      |
| Alu 3     | 30   | 18    | 6              | 6                      |
| B12 0     | 80   | 32    | 24             | 12                     |
| B12 1     | 120  | 70    | 35             | 16                     |
| B12 3     | 30   | 20    | 8              | 8                      |
| B12 4     | 42   | 28    | 8              | 8                      |
| B12 6     | 132  | 77    | 35             | 18                     |
| B12 7     | 110  | 66    | 24             | 18                     |
| B12 8     | 90   | 70    | 14             | 14                     |
| C17 0     | 36   | 18    | 9              | 6                      |
| C17 1     | 30   | 20    | 8              | 8                      |
| Clpl 0    | 64   | 32    | 16             | 12                     |
| Clpl 1    | 36   | 18    | 9              | 9                      |
| Clpl 2    | 16   | 8     | 4              | 4                      |
| Clpl 3    | 144  | 72    | 36             | 18                     |
| Clpl 4    | 100  | 50    | 25             | 15                     |
| Dc1 1     | 25   | 10    | 6              | 6                      |

### Lattices and decomposition methods

- decompose a function into some sub-functions
- implement the decomposed blocks with
  - separate lattices
  - separate regions in a single lattice
- we focus on an extended form of Shannon co-factoring: P-circuits

A *P*-circuit of a completely specified function f is the circuit P(f) denoted by the expression:

$$Pcircuit(f) = (\overline{x}_i \oplus p) f^{=} + (x_i \oplus p) f^{\neq} + f^{\downarrow}$$

where:

### **P-circuits**

The sub-functions  $f^{=}$ ,  $f^{\neq}$  and  $f^{I}$ :

- depends on n-1 variables
- have a smaller on-set
- the synthesized lattice can be smaller

the lattice for f built composing the lattices of  $f^{=}$ ,  $f^{\neq}$  and  $f^{I}$  could be smaller.

The experiments shows that:

- in the 30% of cases the decomposed lattice area is **smaller**
- the average gain is at least 20%

#### Future works:

- more complex types of decomposition
- with more expressive projection *p*
- using *D-reducible* functions
  - $f = \chi_A \cdot f_A$
  - A is the affine space
  - $\chi_A$  is the charateristic function
  - *f*<sub>A</sub> is the projection of *f* onto *A*

> < = >

- Nano-wire construction methods are high sensitive to design variation, defects and environmental factors
- structures with higher defect rates than CMOS (15% faulty components)

State-of-art defect and fault-tolerant techniques are not suitable because are based on small defect rates

We will investigate on reconfiguration and redundancy approaches

- · combination of temporal and structural redundancy
- built-in repair circuitry
- system-level adaptation techniques

> < = >

### Built-in variation, defect and fault tolerance

Self assembly process reduce costs but at the expense of the control of the process

We propose to integrate:

- defect tolerance  $\rightarrow$  improve the yield
- fault tolerance  $\rightarrow$  lifetime reliability
- variation tolerance  $\rightarrow$  predictability & performance

Application-dependent test and diagnosis techniques:

- BIST: built-in-self-test
- BISD: built-in-self-diagnosis
- BISM: built-in-self-mapping
- BISR: built-in-self-repair

The goal is to bypass defective resources using test and diagnosis information.



## Application-Independent Defect Tolerant Flow

### application-dependent

- Defective elements are stored in defect map and avoided
- Not suitable for high-volume production of nano-chips

### defect-unaware design flow

- defect tolerance is performed once
- same set of resources are used for all applications
- all design steps are independent from the defects location
- universal defect-free subset of resources
- defect-free chip subset called design view
- defect-free subset are application independent
- final mapping phase with very low complexity

### The goal

Find **defect-free**  $k \times k$  crossbars inside the original **partially-defective**  $n \times n$ 

- **1** Choose size of k: is correlated to the the manufacturing yield
  - if a  $n \times n$  crossbar contains a  $k \times k$  defect-free crossbar is usable
- **2** during **physical-design** the design is mapped in  $k \times k$  crossbars
- **3** A **defect-aware mapping** step re-map the used resources within  $k \times k$  crossbar inside the  $n \times n$  crossbar
- **(**) Is stored only **the location** of the  $k \times k$  crossbar, not the information about the single crosspoint
- **5** The size of defect map is reduced from  $O(n^2)$  to O(n)

Integrating a **new technology** into semiconductor industry is a long road, it is necessary:

- device performances & manufacturability
- advanced research, technology development and industry-compliant implementation

Emerging nano-technologies have:

- ultimate integration density
- manufacturing and integration  $\rightarrow$  cost-reduction
- reduction of **power consumption**

This project is focused on filling the gap of the emerging technologies in:

- extending the electronic design automation (EDA) flow
- novel computer architecture systems

くぼう くほう くほう