Universal analog computation on programmable nanophotonic integrated circuits

Sunil Pai

PhD Defense, Electrical Engineering

Advisor: Olav Solgaard

Co-advisors: David AB Miller, Shanhui Fan

March 28, 2022

Integrated photonics

based on measurements from experiment

Photonics: "engineering light"

  1. Generation: laser, LED
  2. Detection: photodetector, camera
  3. Control: modulator, waveguide

Fiber optics

Our photonic integrated circuit

Integrated photonics:

  • Integrate one or more above at scale on a chip
  • Ex.: silicon (Si) waveguide "wires" guide light (1550 nm)
  • Energy-efficient and mass-manufactured

Si waveguides

 made in blender

guides light

(like a wire guides electricity)

Programmable photonics

Programmable photonics allows us to shape and manipulate light programmatically.

Programmable bulk optics

Programmable chip-scale photonics

Some projectors use these!

Chip-scale light manipulation

3D holography / augmented reality

Also does math (matrix-vector multiply)

8 mm long

Bulk optical computing

Spatial light modulator

miniaturize

Electronoobs, Youtube

Holoeye Photonics

Programmable photonics

H tree circuit

Photonic binary tree in H-tree config can be used to generate a 2D image

N = \frac{A_TA_R}{L^2\lambda^2}
A_T
A_R
L

Miller, Attojoule Optoelectronics..., JLT, 2016

Programmable photonics model

Machine learning / AI (this talk)

Cryptography / blockchain

Quantum computing

LIDAR (self-driving cars)

Telecom (LiFi, optical phased arrays)

Imaging and biochemical sensing

Augmented / virtual reality

All of these applications follow this model:

Programmable photonics applications

 This talk: how can we perform these applications in a scalable manner in the presence of error?

Photonic matrix multiply

Sensing and communications

Machine learning / AI * (this talk)

Cryptography / blockchain *

Quantum computing

Our experimental focus *

LIDAR (self-driving cars)

Telecom (LiFi, optical phased arrays)

Imaging and biochemical sensing

Augmented / virtual reality

Outline

  1. Nanophotonic processors
     
  2. Error tolerance in binary tree meshes
     
  3. Photonic machine learning
     
  4. Conclusion
     
  5. Questions
     
  6. Reflections and acknowledgements

Outline

  1. Nanophotonic processors
     
  2. Error tolerance in binary tree meshes
     
  3. Photonic machine learning
     
  4. Conclusion
     
  5. Questions
     
  6. Reflections and acknowledgements

Nanophotonic processors

How does this tech work?

What do we have in the lab?

Waveguides

Waveguides guide / confine light

Light is a wave

1D

2D

source

Oxide

Silicon

2D waveguide simulation solving Maxwell's equations

(using 2D FDFD)

Field value

Waveguides and modes

Waveguides guide / confine light

Oxide

Silicon

Degrees of freedom:

  1. Power \(p \in \mathbb{R}^+\)
  2. Phase \(\theta \in [0, 2\pi)\)

\(\theta(x, t) = k_x x - \omega t\)

 

Field and power:

  1. Field:
    \(h = \sqrt{p}e^{i\theta(x, t)}\Psi(y, z)\)
  2. Power density:
    \(s = p|\Psi(y, z)|^2\)

 

Phasor: \(\sqrt{p}e^{i\theta}\)

\(\Psi(y, z)\)

\(\sqrt{p}e^{i\theta}\)

Waveguide mode \(\to\) complex number

Degrees of freedom:

  1. Power \(p \in \mathbb{R}^+\)
  2. Phase \(\theta \in [0, 2\pi)\)

 

Phasor field: \(x = \sqrt{p}e^{i\theta}\)

 

Assumptions:

Coherent, single wavelength (\(\lambda\))

Single mode waveguide

3D

\(x = \sqrt{p}e^{i\theta}\)

monitor

source

Oxide

Silicon

  1. Light (an electromagnetic field) represented using complex numbers \(x\) (phase, power).
     
  2. Drop the waveguide mode (electric or magnetic "field profile") of the light field.

"real part"

A 50/50 coupler/splitter

50/50 coupler and splitter

Oxide

Silicon

Differential phase \(\theta\)

Interaction

Coherent light interference or coupling depends on differential phase \(\theta\)

Same power for modes \(\bm{x}\) and \(\bm{y}\).

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

Interaction

A 50/50 coupler/splitter

50/50 coupler and splitter

Oxide

Silicon

Differential phase \(\theta\)

Interaction

\(\color{darkblue}\begin{bmatrix}y_1 \\ y_2 \end{bmatrix} \color{black} = \frac{1}{\sqrt{2}}\begin{bmatrix}1 & i \\ i & 1 \end{bmatrix}\color{darkred}\begin{bmatrix}x_1 \\ x_2 \end{bmatrix}\)

\(\color{darkred} \begin{bmatrix}x_1 \\ x_2 \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} e^{i\theta} \\ 1 \end{bmatrix}\)

Phasor:

Device operator:

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

Coupled modes

Split ratio: \(\sin^2(\Delta \beta L_{\mathrm{int}})\)

\(L_{\mathrm{int}}\)

Thermal (resistive heating)

Oxide

Silicon

TiN

_

+

Heat

50 mW

Silicon index increases with heat

\(\sqrt{p} e^{i\theta} \to \sqrt{p} e^{i(\theta + \color{darkgreen}\delta \theta \color{black})}\)

\(\color{darkgreen}\delta \theta \color{black} \propto \Delta T L_{\mathrm{PS}}\)

Programmable phase shifter

Apply voltage, achieve phase shift

Phase changes but power stays the same

MZI generator

Set \(\theta, \phi\)

\(\boldsymbol{y} = \left(e^{i\phi}\sin \frac{\theta}{2}, \cos \frac{\theta}{2}\right)\)

Assume \(\bm{x} = (0, 1)\)

(MZI = Mach-Zehnder interferometer)

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

\(\bm{y} = U_2 \bm{x}\)

\(2 \times 2\)

\(2\)-vector

\(2\)-vector

An MZI controls where light goes.

Sweep \(\theta, \phi\) to "nullify" top power

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

We can deduce \(\boldsymbol{x} = \left(e^{-i\phi}\sin \frac{\theta}{2}, \cos \frac{\theta}{2}\right)\)

\(2 \times 2\)

\(2\)-vector

\(2\)-vector

\(\bm{y} = U_2 \bm{x}\)

MZI analyzer

An MZI controls where light goes.

MZI operator

\(\color{darkblue}\begin{bmatrix}y_1 \\ y_2 \end{bmatrix} \color{black} = i \begin{bmatrix}e^{i\phi}\sin \frac{\theta}{2} & \cos \frac{\theta}{2} \\ e^{i\phi}\cos \frac{\theta}{2} & -\sin \frac{\theta}{2} \end{bmatrix}\color{darkred}\begin{bmatrix}x_1 \\ x_2 \end{bmatrix}\)

Any MZI (analyzer orientation) can be represented as performing the following operation:

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

Optical interconnect (grating)

Light needs to get from a laser to the chip

This can be achieved using an optical fiber and focusing grating.

Optical I/O

10\({}^\circ\)

to chip

to chip

Fiber outputs Gaussian beam with mode field diameter (MFD) \(\sim 10 \mu \mathrm{m}\)

Building blocks

Unit cell MZI

MZI "mesh"

Build a nanophotonic processor

Interfere modes

Guide light

Control phase

\(\boldsymbol{y}\)

\(\boldsymbol{x}\)

\(\theta\)

\(\phi\)

Setup or measure mode pair

Arbitrarily reshape light

Optical I/O

Mesh 3D with gratings

Mesh GDS (KLayout)

Photonic matrix multiply

Energy conservation:

\(\|\bm{y}\| = \|U\bm{x}\| = \|\bm{x}\|\)

\(P = \|\bm{y}\|^2 = \|\bm{x}\|^2\)

Why is this useful?

  1. Universality: program arbitrary \(U\) to transform optical fields
  2. Low energy consumption: photons interact weakly with matter
  3. Fast: information travels the fastest physically possible in silicon

\(U\) is a unitary matrix

MZI binary trees

Generate any \(N\)-mode "image"

Analyze any \(N\)-mode "image"

Binary tree: all devices connected to one "root" MZI node

Recursive definition of a generator

\boldsymbol{x} := \begin{bmatrix} \color{darkred}\cos\left(\frac{\theta}{2}\right) \color{darkgreen} e^{i \phi} \color{darkblue} \boldsymbol{v}_{N_1} \\ \color{darkred}\sin\left(\frac{\theta}{2}\right) \color{darkorange}\boldsymbol{v}_{N_2} \end{bmatrix}

node

Recursive definition

\boldsymbol{x} := \begin{bmatrix} \color{darkred}\cos\left(\frac{\theta}{2}\right) \color{darkgreen} e^{i \phi} \color{darkblue} \boldsymbol{v}_{N_1} \\ \color{darkred}\sin\left(\frac{\theta}{2}\right) \color{darkorange}\boldsymbol{v}_{N_2} \end{bmatrix}

node

Recursive definition

Example \(N = 8\)

Generalize from 2 to \(N\) modes

Self-configuration is model-free

Balanced (depth 3)

Unbalanced (depth 7)

Model free = self-corrects any component errors

\(N - 1\) nullifications

Balanced: \(\log_2 N\) steps

Unbalanced: \(N\) steps

Self-configuration of a triangular mesh

"Self-configure" any orthogonal basis set or unitary \(U\) in a universal mesh.

Result:

Mode orthogonality: a mode cannot be linear combination (sum) of other orthogonal modes.

Unbalanced tree cascade

Unitary matrix

\boldsymbol{u}_n^*
U \boldsymbol{u}_n^*

Cascaded binary trees: cascade analyzers to construct any unitary matrix

Key attributes:

  1. New unitary factorization!
  2. A change-of-basis for optical modes
  3. Model-free configuration
U

Generalization: binary tree cascade

\bm{u}_2
\bm{u}_8
\vdots

Why is this useful?

In some cases, we can express some problem in terms of a subset of modes \(M < N\).

Defining fewer modes

Low rank SVD has connections to principal components analysis (PCA) / dimensionality reduction

We can express an arbitrary matrix (not just unitary) using a singular value decomposition (SVD):

Full rank SVD

Low rank SVD

Optical SVD

  • Control: heat material to change speed of light (> 100 Hz)
  • Generation: tunable laser (around 1550 nm), fiber array to focusing grating I/O
  • Detection: grating tap monitors and a camera (100 Hz)
  • Package: wirebonds to PCB / DAC for phase shifter control

Our \(6 \times 6\) triangular mesh

Our optical rig

Key features:

  1. Optical I/O:
    • Polarization controllers
    • 6-axis stage for fiber array
    • Fiber switch (bidirectional operation)
  2. Measurement:
    • Vis/IR microscope (currently used)
    • Output single photodetector
    • Movable stage to image grating spots over large area
  3. Control: NI PXIe unit for phase shifters
  4. Thermal stabilizer via thermoelectric cooler / thermistor sensor on PCB

General "feedforward" mesh definition

Pai et al. Parallel programming... IEEE JSTQE, 2020

"Breadth-first search" to arrange MZIs into columns

Outline

  1. Nanophotonic processors
     
  2. Error tolerance in binary tree meshes
     
  3. Photonic machine learning
     
  4. Conclusion
     
  5. Questions
     
  6. Reflections and acknowledgements

Error tolerance in binary tree meshes

In preparation

My theoretical contribution

Error model of a binary tree "vector unit"

\epsilon^2 = \|\boldsymbol{x} - \widehat{\boldsymbol{x}}\|^2

Device

Ideal

Linear optical error:

Types of systematic error:

  • Constant wavelength "error" (\(\delta\lambda \to \) \(\delta\theta\)), i.e. bandwidth
  • Random coupling error \(\delta \sim \mathcal{N}(0, \sigma)\), phase error \(\delta\theta \sim \mathcal{N}(0, \sigma_\theta)\)
  • We assume lossless device for this analysis

Goal: describe how component errors relate to overall error \(\epsilon^2\).

Unbalanced

Balanced

Architecture dependence

vs

Causes include: calibration error, environmental perturbations (humidity, thermal drift, etc.)

\epsilon^2(\boldsymbol{\Delta}) = \epsilon^2(\boldsymbol{0}) + \boldsymbol{\Delta}^T \frac{\partial \epsilon^2}{\partial \boldsymbol{\Delta}} + \frac{1}{2}\boldsymbol{\Delta}^T \mathcal{H}_{\epsilon^2} \boldsymbol{\Delta} + \cdots
\boldsymbol{\Delta} := [\boldsymbol{\theta} - \widehat\boldsymbol{\theta}, \boldsymbol{\phi} - \widehat\boldsymbol{\phi}] := \boldsymbol{\eta} - \boldsymbol{\eta}'

Phase error \(\Delta\) centered at 0:

\epsilon^2(\boldsymbol{\Delta}) = \|\boldsymbol{x}(\bm{\theta}, \bm{\phi}) - \widehat{\boldsymbol{x}}(\widehat\bm{\theta}, \widehat\bm{\phi})\|^2

Device

Ideal

Error function:

Error model of a vector unit

\(\mathcal{H}_{\epsilon^2}\) is known as the Hessian.

Hessian: \(\mathcal{H}_{\epsilon^2} = \frac{\partial \langle \epsilon^2 \rangle}{\partial \eta_i \partial\eta_j}\)

\mathcal{H}_{\epsilon^2}

Diagonal terms: affect uncorrelated/random errors

Off-diagonal terms: affect mostly constant errors

Set up the foundation for new Hessian error theory of photonic mesh networks.

Error model of a vector unit

Uncorrelated / random errors (e.g. phase error):

Only \(\mathcal{H}_{\theta\theta}\) contribute.

\(E[\delta_i\delta_j] = 0\) and \(E[\delta_i^2] := \sigma_i^2\). Errors add in quadrature.

 

Correlated errors (e.g. bandwidth):

Affected by the entire Hessian (\(E[\delta_i\delta_j] = E[\delta_i]E[\delta_j]\) or \(\delta_i\delta_j \neq 0\)).

\epsilon^2(\boldsymbol{\Delta}) \approx \frac{1}{2}\boldsymbol{\Delta}^T \mathcal{H}_{\epsilon^2} \boldsymbol{\Delta} = \sum_{ij}\mathcal{H}_{ij} \delta_i\delta_j
\mathcal{H}_{\epsilon^2}
\epsilon^2(\boldsymbol{\Delta}) = \epsilon^2(\boldsymbol{0}) + \boldsymbol{\Delta}^T \frac{\partial \epsilon^2}{\partial \boldsymbol{\Delta}} + \frac{1}{2}\boldsymbol{\Delta}^T \mathcal{H}_{\epsilon^2} \boldsymbol{\Delta} + \cdots
\boldsymbol{\Delta} := [\boldsymbol{\theta} - \widehat\boldsymbol{\theta}, \boldsymbol{\phi} - \widehat\boldsymbol{\phi}] := \boldsymbol{\eta} - \boldsymbol{\eta}'

Phase perturbation \(\Delta\) centered at 0:

\epsilon^2(\boldsymbol{\Delta}) = \|\boldsymbol{v}(\bm{\theta}, \bm{\phi}) - \widehat{\boldsymbol{v}}(\widehat\bm{\theta}, \widehat\bm{\phi})\|^2

Device

Ideal

Error function

Error model of a vector unit

Correlation analysis

Correlation diagram

0th order

1st order

2nd order

Sensitivity \(\mathcal{H}_{\theta\theta} = p_\theta\), power through phase shift \(\theta\)

Sensitivity analysis (in general)

\epsilon^2(\delta \theta) = \|\boldsymbol{y}(\theta) - \widehat{\boldsymbol{y}}(\widehat{\theta})\|^2

Perturb one phase shift \(\theta\):

\(\mathcal{H}_{\theta\theta} = \frac{\epsilon^2(\delta\theta)}{\delta \theta^2} \Bigg|_{\delta\theta = 0}\)

Define sensitivity:

For any "feedforward" optical device with I/O \(\bm{x}, \bm{y}\):

Sensitivity diagrammatic proof

\begin{aligned} \epsilon^2(\delta \theta) &= 2 - 2 \mathcal{R}(\boldsymbol{y}^\dagger \hat{\boldsymbol{y}}) \\ &= 2 - 2 \mathcal{R}(\boldsymbol{y}_\theta^\dagger P_{\delta \theta} \boldsymbol{y}_\theta) \\ &= 2 (1 - p_\theta \cos\delta \theta) \approx p_\theta \delta \theta^2 \end{aligned}

Sensitivity analysis in our system

Balanced power distribution \(\to\) more robust

Assume average: \(N\) uniform powers

Total power \(P = 1\)

Unbalanced \(Np_\theta\)

Balanced \(Np_\theta\)

Balanced trees are more robust while having the same number of components.

Gamma distribution

Assume Gaussian random inputs

x_n \sim \mathcal{N}(0, 0.5) + i \mathcal{N}(0, 0.5)
|x_n|^2 \sim \mathrm{Gamma}(1)
\mathcal{P}_{\mathrm{\Gamma}_{N}}(p) = \frac{p^{N - 1}e^{-p}}{\Gamma(N)}

Power in the waveguide spanning \(N\) outputs in its subtree follows this distribution assuming standard random input.

\mathcal{P}(x_n) = \frac{e^{-a_n^2 - b_n^2}}{\pi / 2}
\mathcal{P}(y_n, \varphi) = e^{-y_n}
x_n = a_n + i b_n

Beta distribution

\mathcal{P}_{\mathrm{B}_{N_2}^{N_1}}(s) = \frac{s^{N_1 - 1}(1 - s)^{N_2 - 1}}{\mathrm{B}(N_1, N_2)}

Relative power in waveguide spanning \(N_1\) outputs in its subtree for  \(N_2 = N - N_1\).

Random variables flow like light

Outline of robustness proof

Unbalanced \(Np_\theta\)

Balanced \(Np_\theta\)

\langle \epsilon^2 \rangle \propto \sum\limits_{\theta} p_\theta = \begin{cases} \log_2 N \sigma^2 & \mathrm{balanced}\\ N \sigma^2 & \mathrm{unbalanced}\\ \end{cases}
\epsilon^2(\boldsymbol{\Delta}) = \|\boldsymbol{x}(\bm{\theta}, \bm{\phi}) - \widehat{\boldsymbol{x}}(\widehat\bm{\theta}, \widehat\bm{\phi})\|^2

\(\mathcal{H}_{\theta\theta} = \frac{\epsilon^2(\delta\theta)}{\delta \theta^2} \Bigg|_{\delta\theta = 0} = p_\theta \)

Error function scales with sum of powers through waveguide segments

Total error:

Sensitivity:

sum over all phase shifts

Balanced = more robust to phase, coupling

\langle \epsilon^2 \rangle \propto \sum\limits_{\theta} p_\theta = \begin{cases} \log_2 N \sigma^2 & \mathrm{balanced}\\ N \sigma^2 & \mathrm{unbalanced}\\ \end{cases}

Self-configuration corrects coupling error

Balanced trees: \(\propto \log N \sigma^2 \to \log N \sigma^4\)

Unbalanced trees: \(\propto N \sigma^2 \to N \sigma^4\)

Note: \(2^{16} = 65536\)

After self-configuration

Tens of thousands of modes feasible!

Balanced = more robust to phase, coupling

\langle \epsilon^2 \rangle \propto \begin{cases} \log_2 N \delta^2 & \mathrm{balanced}\\ N \delta^2 & \mathrm{unbalanced, coupling}\\ N^2 \delta^2 & \mathrm{unbalanced, phase}\\ \end{cases}

Balanced vs unbalanced Hessian

Balanced trees are affected less by correlated error compared to unbalanced trees.

Unbalanced \(\mathcal{H}_{\epsilon^2}\)

Balanced \(\mathcal{H}_{\epsilon^2}\)

Balanced trees more robust (Hessian)

Binary tree cascade error scaling

Number of trees (\(M\), rank) reduces the performance gap.

Balanced architectures go from \(\log N \to N\) error scaling as \(M \to N\).

\epsilon_{N, M}^2 = \sum_{m = 1}^M\frac{\|\boldsymbol{x}_m - \hat{\boldsymbol{x}}_m\|^2}{M}

Constant sqrt sensitivity \(\epsilon_{N, M} / \delta\)

Random error sqrt sensitivity \(\epsilon_{N, M} / \sigma\)

Balanced trees

  • \(N - 1\) nodes
  • Self-configurable
  • Broadband
  • Faster to program
    • \(\log N\) steps
  • More error tolerant
    • \(\log N\) depth
  • Poor cascadability

Unbalanced trees

  • \(N - 1\) nodes
  • Self-configurable
  • Narrowband
  • Slower to program
    • \(N\) steps
  • Less error tolerant
    • \(N\) depth
  •  Compact cascadability

Summary

A new sensitivity theory

\langle \epsilon^2 \rangle \propto \sum\limits_{\theta} p_\theta = \begin{cases} \log_2 N & \mathrm{balanced}\\ N & \mathrm{unbalanced}\\ \end{cases}

A new cascade architecture

Balanced cascades outperform unbalanced cascades for small \(M\)

Outline

  1. Nanophotonic processors
     
  2. Error tolerance in binary tree meshes
     
  3. Photonic machine learning
     
  4. Conclusion
     
  5. Questions
     
  6. Reflections and acknowledgements

Photonic machine learning

In preparation

My experimental contribution

Photonic neural networks

Need in commercial AI:

  • Increasing energy budget (3-4 month doubling time, OpenAI 2018)
  • Transistor density limit (Moore's law)

Advantages of photonic mesh

  • Energy-efficient photonic matrix multiply
  • Scalable up to hundreds of modes (limited by error and loss)

Data

Photonic neural network engine

Facial recognition

Self-driving car

Recommendations

Chat-bot

(Photonic neural net = PNN)

Intelligent response

Hybrid PNN inference task

Cost function: \(\mathcal{L}(\bm{y}, \widehat\bm{y})\)

Desired labels: \(\bm{y}\)

Probabilities: \(\widehat\bm{y}\)

Data

\(\color{darkblue}\bm{y}^{(\ell)} \color{black}= \color{darkgreen}U^{(\ell)}\color{darkred}\bm{x}^{(\ell)}\)

On-chip

\(\color{darkred}\bm{x}^{(\ell + 1)} \color{gray}= f^{(\ell)}(\color{darkblue}\bm{y}^{(\ell)}\color{gray})\)

Off-chip

Use photonics to classify handwritten digits from 0 to 9

MNIST dataset

PNN inference simulation

\(8 \times 8\)

Electro-optic nonlinearity

Williamson et al. JSTQE 2019

In just 200 epochs (dataset passes) we can achieve near-perfect accuracy (98%) in MNIST

Can this training be done using photonics experiment?

Train / test split: 80 % / 20 %

 

Update method: "Adam" update

Training method: Batch gradient descent

Model training

PNN training simulation

Output measurement task

Experimental inference results on our chip

\(\mathcal{L}(\boldsymbol{x}) = \mathrm{softmax\ cross\ entropy}[|U_3|U_2|U_1 \boldsymbol{x}|||^2]\)

\(\boldsymbol{x} = (x_1, x_2, p, p, 0), \|\boldsymbol{x}\| = 3 \)

Classification problem: Sklearn 2D point classification

Calibration protocol

To program unitaries and inputs, we need to calibrate the chip.

Calibration protocol (voltage vs phase)

Phase shifting time

0

\(\pi / 2\)

\(\pi\)

Need a fast PD to measure phase shift time (WIP)

Likely less than 1 kHz switch time

Perturbative gradient-based training

Evaluate the entire cost function once per parameter, \(D\) params

This is highly inefficient especially in hybrid PNNs (our use case)

\frac{\partial \mathcal{L}}{\partial \theta} \approx \frac{\mathcal{L}(\theta + \delta\theta / 2) - \mathcal{L}(\theta - \delta \theta / 2)}{\delta \theta}

Perturbative gradients = numerical differentiation

\(D\) can be in the millions in modern neural nets

To date, no photonic/optical backpropagation (machine learning) has been experimentally demonstrated.

Backpropagation gradient-based training

Green bars: measure power going through phase shifter

Advantages over perturbative: modular (for hybrid PNN) and efficient (\(D\) times faster)

Backpropagation = widely used for training (fueled 2010s deep learning boom)

Backprop \(\leftrightarrow\) adjoint method inverse design

In situ backpropagation incorporates an experimental implementation of inverse design.

Gradient update: \(\frac{\partial \mathcal{L}}{\partial \epsilon} = -\mathcal{R}\left(\boldsymbol{b}_{\mathrm{aj}}^T \hat A^{-1} \frac{\partial \hat A}{\partial \epsilon} \hat A^{-1} \boldsymbol{b}\right)\)

Similarity:

Phase shifts \(\bm{\theta}\) are related to permittivity \(\bm{\epsilon}\) of the inverse design problem.

Cost \(\mathcal{L}\) is a desired mode overlap

Hughes et al., Shanhui Fan. Optica 2018.

Operator \(\hat A = (\nabla \times \nabla \times - k_0^2 \epsilon)\)

Freq-domain equation: \(\bm{b} = \hat A(\omega) \bm{e}\)

input source

field

NQP lab, Stanford

Data

Backpropagation can be applied to a hybrid multilayer system

Leverage the energy of photonics for both directions:

"Linear in optics, nonlinear in electronics"

Overall backpropagation approach

Experiment

Circle dataset (easy)

80-20 train-test split

Adam update (i.e., not SGD)

250 data points total

Autodiff powered by JAX/Haiku

Photonic backpropagation results: circle

This is the first demonstration of backpropagation in an optical chip to our knowledge

96%/93% model test/train accuracy after training

Experiment

Moons dataset (medium)

80-20 train-test split

Adam update

500 data points total

Autodiff powered by JAX/Haiku

 

Note: We use the correct (expected phase) instead of the measured phase (order of magnitude more error)

Photonic backpropagation results: moon

Digital and in situ (on chip) training agree

We observe excellent agreement between simulated digital training and in situ training despite gradient error.

96%/93% model test/train accuracy after training

Phase measurement is critical for accurate gradient

Summary

We have a functioning experimental prototype of a photonic mesh

This can be used for many applications.

We choose a machine learning application evaluated on standard 2d classification problems.

Inference task: high accuracy (98% on moons dataset)

Training task: First demo of analog (in-situ, on-chip) backpropagation on an optical chip

96%/93% model test/train accuracy on noisy circle dataset

Analog update approach

Key changes:

  1. Only monitor power in final step
  2. Introduce backprop unit, \(\zeta\)
  3. Extract AC component @ \(\zeta = 0\)

Parallelize over all layers in the network.

In our setup: we "simulate" this step without a backprop unit.

Analog backprop problem

The fractional phase gradient error becomes more of an issue as it gets smaller due to active thermal noise / drift.

Phase error means "distance from optimal value."

Mean square fractional error: \(1- \boldsymbol{g} \cdot \hat{\boldsymbol{g}}\), where \(\boldsymbol{g}, \hat{\boldsymbol{g}} \) is normalized

\(\mathcal{L}_m = 1 - |\widehat{\bm{u}}_m^T \bm{u}^*_m|^2\), \(\bm{u}_m\) is row \(m\) of \(U\)

  1. Program \(U\)
  2. Send \(\color{darkred}\bm{u_m}^*\), measure field \(\color{darkblue}p_m\) at port \(m\).
  3. Adjoint field: send just \(\color{darkred}p_m\) back, measure \(\color{darkblue}\bm{u}_\mathrm{aj}^*\).
  4. Program \(\color{darkorange}\boldsymbol{u} + e^{i \zeta_{\mathrm{aj}}} \boldsymbol{u}_{\mathrm{aj}}^*\)

Analog vs digital

Digital update

Analog update

backprop unit

Photonic advantage

Digital: scaling, nonlinearities, elementwise ops that are \(O(N)\)*

Analog: Only linear optics \(O(N^2)\)

Do we beat digital with photonic hybrid solution?

Off-chip

On-chip

Note: Most of the energy is in analog-digital conversion

Problem: digital subtraction is costly due to A/D conversion.

Future work

  • Minibatch training:
    • training with single examples uses accurate gradients
    • but the gradients become inaccurate when training on multiple examples at a time (a minibatch)
       
  • Phase measurement:
    • We use our photonic mesh to measure relative phases.
    • This can be achieved more straightforwardly (and possibly more accurately) using a homodyne detection scheme.
       
  • Complete analog demo
    • We can implement backprop unit for a full analog demo.

Outline

  1. Nanophotonic processors
     
  2. Error tolerance in binary tree meshes
     
  3. Photonic machine learning
     
  4. Conclusion
     
  5. Questions
     
  6. Reflections and acknowledgements

Overall summary

We experimentally and theoretically explored programmable linear operations in optics.

Theory: new error theory for universal binary tree circuits

Experiment: first backprop demo

\frac{\langle \epsilon^2 \rangle}{\sigma^2} \propto \begin{cases} \log_2 N & \mathrm{balanced}\\ N & \mathrm{unbalanced}\\ \end{cases}

Key learnings

based on measurements from experiment

Our photonic integrated circuit

  • Collaboration is critical
  • Try to be a master of everything but know when to dive deep
  • Designing a photonic circuit takes three times longer than what you promise
  • In case there is a worldwide pandemic, make sure you can run experiments remotely
  • Overengineering can sometimes yield good results
    • We moved a motorized stage > 100000 times for backprop
  • Your final experiment may not be what you intend

Si waveguides

 made in blender

PhD accomplishments

The first demonstration of backpropagation in an optical chip.

New theory of error-tolerant cascaded binary tree optical devices.

In this talk (upcoming papers):

Not in this talk (upcoming and past papers):

Photonic blockchain using photonic meshes

Design/simulation/testing of MEMS phase shifters, couplers

Parallel programming of an arbitrary feedforward photonic network

Matrix optimization on universal unitary photonic devices

Google Scholar

PhD software contributions

Phox framework (work in progress)

  1. neurophox: Photonic neural networks in TF / Pytorch
  2. dphox: Automated photonic design
  3. simphox: Inverse design and photonic meshes in JAX.
  4. phox: (currently private) Lab automated photonic testing

 

Goal of the project: a full stack open source framework for programmable photonics!

 

Related paper: In preparation.

Questions?

Mesh operation

Acknowledgements

Committee

David Miller

Olav Solgaard

Shanhui Fan

Joseph Kahn

Martin Fejer

Collaborators/ Funding

Olav's glorious bike ride

"DAB Miller" (found in lab)

Coworkers

OSA

Dinner party!

SUPR retreat

Fan lab

  • Ian Williamson
  • Tyler Hughes
  • Momchil Minkov
  • Ben Bartlett
  • Beicheng Lou
  • Lingling Fan

Solgaard lab

  • Nathnael Abebe
  • Zhanghao Sun
  • Rebecca Hwang
  • Taewon Park
  • Dylan Black
  • Yu Miao
  • Annie Kroo
  • Simón Lorenzo
  • Payton Broaddus
  • Carson Valdez
  • Stephen Hamann
  • Andrew Ceballos

Halloween

Grad party

Fan group hike

Friends (decade at Stanford, c/o 2015)

Family

Thank you!

Photonic blockchain

Photonic blockchain applications

Proof of photonic computational work ensures security for:

  1. Cryptocurrency
  2. DDoS attack / malware protection
  3. Email spam filtering
  4. Voting systems

Photonic blockchain: any blockchain application that includes photonic hardware as part of the proof of work (PoW) computation.

Photonic PoW

Blockchain

Crypto

Photonic cryptocurrency

Core question: can we build a photonic blockchain technology using a systematic error-prone analog device?

Photonic crypto experiment / model

Hash error rate: \(1 - (1 - \mathrm{BER})^{256} \approx 256 \cdot \mathrm{BER} \), assume independent

Key assumptions

  1. Perfect singular values
  2. Most error in \(U, V^\dagger\)
  3. Systematic error dominates random error

Systematic vs random error

Systematic error examples:

  1. Loss imbalance error
  2. Coupling error
  3. Phase error / instability

 

 

Random error examples

  1. Shot noise
  2. Polarization noise
  3. Thermal noise

Error correction

Hardware-agnostic error correction:

Same energy, more footprint

Expected improvement:

Systematic error reduced by factor of \(\sqrt{R}\)

Random error stays the same

Thresholding

Note: Bulk of the energy in photonic computing is in analog to digital conversion

Photonic crypto experiment

Error correction results in less output error and smaller hash error rate

Tick marks are 2 apart

Photonic crypto experiment

Dispersion analysis:

Improve time-efficiency by parallelizing computation over many wavelengths.

Only works up to some acceptable error.

Error correction

Error correction reduces the output error standard deviation.

Photonic crypto experiment (cont.)

This is only for \(N = 4\), but what about \(N > 4\)?

Photonic crypto simulation

Photonic crypto simulation

Key conclusion: Sharp boundary between feasible/infeasible

Key finding:

\(\sigma_{\mathrm{out}} \propto NK\sigma\) for phase, coupling

\(\sigma_{\mathrm{out}} \propto N^{3 / 2}K\sigma\) for loss

 

Why to increase \(N\):

Photonic advantage is higher

 

Why to increase \(K\):

Smaller footprint, more output bits