Summary of - SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
Paper published by: Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla at CVPR ’15
Some videos I used to play around with SegNet
For more info on replicating this visit my repo here
TLDR:
- Uses a novel technique to upsample encoder output which involves storing the max-pooling indices used in pooling layer. This gives reasonably good performance and is space efficient
- VGG16 with only forward connections and non trainable layers is used as ÷encoder. This leads to very less parameters.
Problem
- Semantic pixel-wise labelling i.e. labelling each pixel of an image to belong to some class(tree, road, sky, etc) as shown in the image.
Some applications include autonomous driving, scene understanding, etc. Direct adoption of classification networks for pixel wise segmentation yields poor results mainly because max-pooling and subsampling reduce feature map resolution and hence output resolution is reduced. Even if extrapolated to original resolution, lossy image is generated.
SegNet - Challenges
- Trained on road scene datasets hence, classes represent macro objects, hence segmentations are desired to be smooth
- Boundary information is critical for objects like road markings and other small objects. (Boundary delineation)
- Major use cases will be embedded systems hence it must be Computationally Efficient
SegNet- Architecture
Encoder-Decoder pairs are used to create feature maps for classifications of different resolutions.
Encoder
- 13 VGG16 Conv layers
- Not fully connected, this reduces parameters from 134M to 14.7M
- Good initial weights are available hence these layers are made non trainable
Each encoder is like Fig 3. The novelty is in the subsampling stage, Max-pooling is used to achieve translation invariance over small spatial shifts in the image, combine that with Subsampling and it leads to each pixel governing a larger input image context (spatial window). These methods achieve better classification accuracy but reduce the feature map size, this leads to lossy image representation with blurred boundaries which is not ideal for segmentation purpose. It is desired that output image resolution is same as input image, to achieve this SegNet does Upsampling in its decoder, to do that it needs to store some information. It is necessary to capture and store boundary information in the encoder feature maps before sub-sampling. In order to to that space efficiently, SegNet stores only the max-pooling indices i.e. the locations of maximum feature value in each pooling window is memorised for each encoder map. Only 2 bits are needed for each window of 2x2, slight loss of precision, but tradeoff.
- Advantages
- Improved boundary delineation
- Less number of parameters
This form of upsampling can be incorporated in any encoder-decoder architecture
Decoder
-
For each of the 13 encoders there is a corresponding decoder which upsamples the feature map using memorised max-pooling indices
-
Sparse feature maps of higher resolutions produced
-
Sparse maps are fed through a trainable filter bank to produce dense feature maps
-
The last decoder is connected to a softmax classifier which classifies each pixel
SegNet paper compares its technique with several other decoders as shown in Fig 5.