








(Figure from Stanford UFLDL Tutorial)
(Figure from Yann LeCun)

(Figure from Stanford UFLDL Tutorial)
(Figure from Yann LeCun)



An online interactive CNN visualization example:

But not all data reside in a regular array-like space. In particular, networks

Examples of networks, or graph-based data structure



Generalized convolution














(Figure from Alex Graves)
Mathematical details: Let $\norm{\ppf{\vh_{\tau+1}}{\vh_\tau}}\approx \alpha$
$$ \begin{align} \norm{\ppf{\cL}{\vh_t}} &\propto \norm{\ppf{\vh_{t+1}}{\vh_t}\ppf{\cL}{\vh_{t+1}}} \propto \norm{ \left(\prod_{\tau=t}^{T-1}\ppf{\vh_{\tau+1}}{\vh_\tau}\right) \ppf{\cL}{\vh_T}} \leq \prod_{\tau=t}^{T-1}\norm{\ppf{\vh_{\tau+1}}{\vh_\tau}} \norm{\ppf{\cL}{\vh_T}} \\ \Rightarrow \norm{\ppf{\cL}{\vh_t}} &\approx \alpha^{T-t} \norm{\ppf{\cL}{\vh_T}} \end{align} $$At very old steps, i.e. when $T\gg t$,
(Figure from Alex Graves)
(Figure from Alex Graves)The vanishing gradient problem turns out to be universal in deep learning, e.g. in CNN architectures.
This leads to the skip connection technique (which is now standard) and the Residual Network (ResNet).

Ref: Deep Residual Learning for Image Recognition, arXiv 1512.03385 (100k+ citations now...)
The ResNet essentially makes the following change: $$ \vx^{(j+1)} = \vF(\vx^{(j)}) \quad\Rightarrow\quad \vx^{(j+1)} = \vx^{(j)} + \vF(\vx^{(j)}) $$ to provide a "bypass" for the back-propagation.

Recall for a first-order ordinary differential equation (ODE) $$ \dot\vx = \vf(\vx),\quad \vx(0)=\vx_0,\quad t\in[0,T] $$ Forward Euler method with step size $\Delta t$, $\vx_j=\vx(j\Delta t)$ $$ \vx^{j+1} = \vx^{j} + \Delta t\vf(\vx^{j}) $$ Then $\Delta t\vf(\vx^{j})$ is as if the ResNet block $\vF(\vx^{(j)})$ in previous slide!

Ref: Neural Ordinary Differential Equations, arXiv 1806.07366
But ...
There of course many ... For example
And in the bigger picture: