# Lecture 15: Deep CNN architectures

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1kD3_vXFwesra2AhY-_SribsZ2a1bI82A)

In [1]:
import datetime
now = datetime.datetime.now()
print("Version: " + now.strftime("%Y-%m-%d %H:%M:%S"))

Version: 2024-01-10 00:30:27


## Classical CNN architecture

### General CNN architecture

- (convolution, activation, pooling) $\times N_1$
- (fully connected layer) $\times N_2$




<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/aCNN.jpeg" alt="Drawing"  width="1100px" style="display:block; margin:auto"/>

[Credit: Geron]

### Decreasing resolution and increasing number of channels

In typical architectures decrease image resolution and increase number of channels as progress deeper in the network.

Decreasing image resolution (with the same convolutional kernel size), acts to increase the size of the receptor field of neurons as progress deeper in the network.

Increasing number of channels (i.e. filters), provides larger feature set (and is computationally possible since image resolution decreased).





#### For example VGG-16

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/vgg16.png"  width="900px" style="display:block; margin:auto"/>



Networks becoming very deep, e.g. VGG-16 has 138 million parameters.

Even the techniques we have discussed for training deep networks can begin to struggle.

## ResNet

ResNets (residual networks) were introduced to mitigate problems of training deep networks.

Introduce skip connections, which are a common feature of many of the latest cutting-edge deep learning architectures being developed today.

### Standard neural network block

Recall the classical neural network block.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_skip_connection_skip_removed.png" width="500px" style="display:block; margin:auto"/>

[[Credit (modified)](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

### Residual block

ResNets introduce a residual block, with a connection that skips a layer and connects the activations of one layer to another layer deeper in the network.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_skip_connection.png" width="500px" style="display:block; margin:auto"/>

[[Credit](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

Information can flow directly from $a^{[l]}$ to $a^{[l+2]}$, and so can more easily
flow deeper in network.

The connection is drawn as connecting *into* the subsequent layer (rather than after it) since the connection is typically made *before* the non-linear activation function, e.g. ReLU.

### Residual connection

The residual connection involves *adding* the earlier activation, which is typically added before the activation.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_skip_connection_internal.png" width="900px" style="display:block; margin:auto"/>

[[Credit](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

### Residual network architecture

ResNet architectures constructed by concatenating residual blocks.

#### Standard architecture
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_plain.jpg" width="700px" style="display:block; margin:auto"/>

[[Credit: He et al.](https://arxiv.org/abs/1512.03385)]

#### ResNet architecture
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_residual.jpg" width="700px" style="display:block; margin:auto"/>

[[Credit: He et al.](https://arxiv.org/abs/1512.03385)]

In order to add activations at different levels, must have compatible shapes.

ResNets architectures often include operations that preseve the shape of activations.  When shapes are not preservered a suitable adjustment is made in the skip connection, e.g. downsampling.

### Why are ResNets effective?

ResNets revise the computation of the next activation as follows:
\begin{align*}
a^{[l+2]} = g(z^{[l+2]}) \rightarrow
a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) .
\end{align*}

Expanding this out in terms of the intermediate activation:
\begin{align*}
a^{[l+2]} &= g(z^{[l+2]} + a^{[l]}) \\
&= g(W^{[l+2]} a^{[l+1]} + b^{[l+2]} + a^{[l]}) .
\end{align*}

Relatively easy for network to learn $W^{[l+2]}=0$ and $b^{[l+2]}=0$ (particularly with small weight initialisation and regularisation).

Then, for ReLU, $a^{[l+2]} = g(a^{[l]}) = a^{[l]}$.

So adding additional blocks generally shouldn't hinder performance and has the potential to further improve performance (each block can learn a residual to improve performance or leave essentially unchanged).

### Performance for deep networks

Due to issues with training deep standard networks, performance generally starts to decrease if the network get too deep.

For ResNets, increasing depth generally continues to lead to improving performance.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/resnet_training_loss.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

### Skip connections

The residual connection is an example of a skip connection.

In ResNets, the connection is made by *adding* activations.  Alternatively, one could also concatenate layers to allow information to flow deeper into the network more easily.

Skip connections are a useful concept used widely in many cutting-edge architectures.

Will see concept again later in this lecture when we look at UNets.

**Exercises:** *You can now complete Exercise 1 in the exercises associated with this lecture.*

## Inception 

### Motivation for Inception

What size convolutional kernel should we use?

Can decide empirically by cross-validation but could also use many at once and let network decide how to combine.

This is the general idea behind the Inception module.  

Leverages 1x1 convolutions.

### 1x1 convolution

1x1 convolution is a powerful layer used in many cutting-edge deep learning architectures.

But isn't a 1x1 convolution just multiplying by a number?

It is for a single channel but not when considering multiple input and output channels.

#### Graphical illustration of 1x1 convolution

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/1x1-convolution1.png" width="700px" style="display:block; margin:auto"/>

When have multiple input channels, weighting, summation, and activation applied across channels.

Acts like a fully-connected neural network across channels.  Sometimes called *network in a network*.

Then repeat with multiple filters to create multiple output channels.

#### 1x1 convolution to control channel size

1x1 convolutions often used to control the number of channels at intermediate points in a network, e.g. as a channel bottleneck (as well see shortly in the Inception module).




### Inception module

We saw how 1x1 convolutions can be considered as a *network in a network*.

Need to go deeper!

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/we_need_to_go_deeper.jpeg" width="700px" style="display:block; margin:auto"/>

#### General Inception module

Consider multiple kernel sizes at once and a pooling layer.  Then concatenate outputs.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/inception_module_szegedy_1.png" width="700px" style="display:block; margin:auto"/>

[[Credit: Szegedy et al.](https://arxiv.org/abs/1409.4842)]

This architecture can quickly become computationally demanding.

#### Inception module with 1x1 convolutions

Include 1x1 convolutions as a channel bottleneck to reduce computational cost.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/inception_module_szegedy_2.png" width="700px" style="display:block; margin:auto"/>

[[Credit: Szegedy et al.](https://arxiv.org/abs/1409.4842)]

#### Example of Inception module computational costs

Consider a 28x28 input feature map, with 192 channels.  Require an output map with resolution 28x28 and 32 channels ("same" convolution so input and output resolutions the same).

##### Standard convolutional layer

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/inception_no_bottleneck.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

Number of flops = (28 x 28 x 32) x (5 x 5 x 192) = 120,422,400 = 120 million

##### 1x1 convolution bottleneck

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/inception_bottleneck.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://medium.com/machine-learning-bites/deeplearning-series-convolutional-neural-networks-a9c2f2ee1524)]

Number of flops = (28 x 28 x 16) x (1 x 1 x 192) + (28 x 28 x 32) x (5 x 5 x 16) = 2,408,448 + 10,035,200 = 12 million

Generally, performance is not significantly degraded (within reason).

### Inception network / GoogLeNet architecture

Overall Inception network architecture (also called GoogLeNet, cf. LeNet) involves combining multiple Inception modules.

#### Inception module (from above but drawn sideways)

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/inception-module1.png" width="500px" style="display:block; margin:auto"/>

#### Inception network

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/googlenet_diagram1.png" width="1000px" style="display:block; margin:auto"/>

[[Credit: Szegedy et al.](https://arxiv.org/abs/1409.4842)]

## MobileNet

### Motivation for MobileNet

The architectures we've seen above are computationally demanding (even when leveraging 1x1 convolution channel bottlenecks).

MobileNet architecture provides a more computationally efficient architecture, for example for low cost deployment on mobile devices (hence its name).

Based on *depthwise separable convolution*, which includes a *depthwise convolution*, followed by a *pointwise convolution*.

### Recap standard convolution

#### Multiple input channels, single output channel

Consider 5x5 kernel with no padding and stride of one.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/dsc_normal_conv_1_channel.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)]

Number of flops = (8 x 8) x (5 x 5 x 3) = 4,800 (no padding)

#### Multiple input channels, multiple output channels

Repeat the above for each output channel.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/dsc_normal_conv_256_channels_annotated.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)]

Number of flops = (8 x 8 x 256) x (5 x 5 x 3) = 1,228,800 = 1.2 million

### Depthwise convolution

One filter for each channel (no summation over channels).  Number of input and output channels are the same.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/dsc_depthwise_conv.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)]

Number of flops = (8 x 8) x (5 x 5) x 3 = 4,800

Computation cost reduced to setting of a single output channel.

But no mixing across channels and number of output channels must be identical to input.

### Pointwise convolution

Introduce mixing of output channels using pointwise convolutions (1x1 convolutions).  

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/dsc_pointwise_conv_1_channel.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)]

Number of flops = (8 x 8) x (1 x 1 x 3) = 192

Can also control number of output channels.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/dsc_pointwise_conv_256_channels_annotated.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728)]

Number of flops = (8 x 8 x 256) x (1 x 1 x 3) = 49,152

### Depthwise separable convolutions

Depthwise separable convolutions include depthwise convolution, followed by pointwise convolution.  Separable since we separate the spatial and channel mixing.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/Depthwise-separable-convolution-block.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://www.researchgate.net/figure/Depthwise-separable-convolution-block_fig1_343943234)]

In the example considered above, 1.2 million flops (standard convolution) $\rightarrow$ 4,800 (depthwise convolutions) + 49,152 (pointwise convolutions) = 53,952 (same output resolution and number of channels).

Generally, performance is not significantly degraded (within reason).

### MobileNet architectures

#### MobileNet v1

Use depthwise separable convolutions as building block for architecture.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/mobilenet_v1.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://www.coursera.org/learn/convolutional-neural-networks/)]

#### MobileNet v2

Add a pointwise 1x1 convolution before the depthwise separable convolution to increase number of channels in the intermediate stage (results in inverted channel bottleneck).

Also include residual connection (same as ResNet).

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/mobilenet_v2.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://www.coursera.org/learn/convolutional-neural-networks/)]

## UNet

### Motivation for UNet

So far we've considered problems where the ouputs of the machine learning model are very low resolution, e.g. classification, low-dimensional regression.

For many problems we require high-resolution outputs for dense predictions.

We need to modify architectures to support dense predictions.

#### Semantic segmentation

Semantic segmentation is a common type of problem where we require a high-resolution output with dense predictions.

Goal is to predict a class for every single pixel in an image.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/semantic_segmentation.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://www.jeremyjordan.me/semantic-segmentation/)]

#### Segmentation of medical images

The UNet architecture was initially proposed for segmentation of medical images but has proven useful widely.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/semantic_segmentation_unet.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://arxiv.org/abs/1701.08816)]

### General approach

Naive approach is to adopt standard architectures and stay at high-resolutions

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/cnn_high_res_naive.png" width="900px" style="display:block; margin:auto"/>

[[Credit](http://cs231n.stanford.edu/)]

If don't reduce image resolutions through network, then we need very large kernels deeper into the network to have larger receptive fields.  Also difficult to increase number of channels due to computational costs.

Becomes extremely computationally demanding.

Alternative approach is to reduce image resolution through network as usual but then include subsequent layers to increase resolution.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/cnn_high_res_down_up.png" width="900px" style="display:block; margin:auto"/>

[[Credit](http://cs231n.stanford.edu/)]

Require an upsampling layer.

### Transpose convolution

Transpose convolutional layer provides a way to upsample images.

#### Mathematical representation

Recall the convolution output is given by
\begin{align*}
z_{i,j} = \sum_{u,v} w_{u-i,v-j} x_{u,v} ,
\end{align*}
where $x$ is the input image, $w$ is the filter (kernel) and $i$ ($u$) and $j$ ($v$) denote row and column indices, respectively.

Can represent convolution in matrix form:

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/conv_eq.png" width="900px" style="display:block; margin:auto"/>

Convolution applied by matrix multiplication with $\mathsf{W}$.

Consider multiplication of transpose of  $\mathsf{W}$:

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/transpose_conv_eq.png" width="900px" style="display:block; margin:auto"/>

Can see transpose convolution involves placing shifted kernels down on *output*, weighting by input at shifted position and summing.

Contrast with convolution, which involves placing shifted kernels down on *input*, weighting by inputs that overlap and summing.

#### Graphical representation

Consider input, kernel and output shape.

<!--
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/transpose_conv_input.png" width="150px" style="display:block; margin:auto"/>
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/transpose_conv_kernel.png" width="150px" style="display:block; margin:auto"/>
<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/transpose_conv_output.png" width="200px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/transposed-convolution-demystified-84ca81b4baba#)]
-->

Transpose convolution is given by placing shifted kernels down on *output*, weighting by input at shifted position and summing.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/transpose_conv_result.png" width="1000px" style="display:block; margin:auto"/>

[[Credit](https://towardsdatascience.com/transposed-convolution-demystified-84ca81b4baba#)]

### UNet architecture

UNet architecture leverages standard convolutions and pooling for downsampling stages.

Then adopts transpose convolution for upsampling stages.

Also includes skip connections to copy higher resolution feature maps from the downsampling path to the upsampling path.

<img src="https://raw.githubusercontent.com/astro-informatics/course_mlbd_images/master/Lecture15_Images/unet_paper.png" width="700px" style="display:block; margin:auto"/>

[[Credit](https://arxiv.org/abs/1505.04597)]