In this post, we would like to give a review of selected papers on “Image Inpainting”. Image inpainting is a field of research where you are filling in the missing area within a picture. The goal of this task is to generate the image looking as realistic as possible. Check the generated image from the paper Generative Image Inpainting with Contextual Attention (2018). 

In this post, we would like to cover 3 papers to get a glimpse of how the field has evolved.

  • Context Encoders: Feature Learning by Inpainting                        (2016)
  • Generative Image Inpainting with Contextual Attention                (2018)
  • Image Inpainting for Irregular Holes Using Partial Convolutions  (2018)

This work’s main idea is to generate the missing part of the image using the Encoder Decoder structure trained on the adversarial loss.

The input is a picture with a missing region in the center. The encoder produces a latent feature representation of that image. The decoder produces the missing image content. The network will be trained to match the ground truth of the missing region.

To train the network, the authors suggested using the following losses.

  1. Reconstruction loss

    The reconstruction loss is an l2 loss between ground truth image and the produced image.
  2. Adversarial loss

    The adversarial loss is given so that the generator produces the image that tries to make it hard for the discriminator to distinguish between the generated and real images.

Below image shows a result from the paper, and as you can see, the quality of the produced image could be made better.

Furthermore, the architecture depends on the missing region’s shape, which could be inconvenient in a real-world application.


This paper uses a two-stage network and a contextual attention module to improve the quality of the generated images.

The architecture is described as Coarse to Fine Network.

  • Coarse network :
    • architecture: convolution + dilated convolution + upsampling + convolution
    • input: input with hole, mask, 1-mask (5 channel image)
    • loss: spatially discounted L1 loss (reconstruction loss)
      • The inpainting task can have many plausible solutions for a given context. So the strong enforcement of the reconstruction loss in those pixels may mislead the training process.
      • Missing pixels near the hole boundaries have much less ambiguity than those pixels closer to the center of the hole.
      • So weight the loss according to the distance from the border.
  • Refinement Network :
    • The architecture contains two path + concatenation
    • Input: composite of the original image and coarse output, mask, 1-mask (5 channel image)
    • During the forward process, the path splits after the initial convolution + dilated convolution.
    • The contextual layer is described in detail below.


  • Contextual Layer
    • The contextual layer is a module in the refinement network, and it calculates the similarity within the image. The motivation behind this module is that when you are reconstructing a missing region, you are going to look at the most similar looking region in other parts of the same image.
    • To calculate the similarity effectively, the authors propose the following method; Extract patches + convolution operation + softmax
    • The below pictures might aid with the understanding.
      1. Extract patches from the features
      2. Take one of the extracted patches and use it as a  filter to convolve with the feature itself
      3. Repeat for all extracted patches
      4. Repeat for all extracted patches
      5. Repeat for all extracted patches and aggregate channel-wise
      6. Repeat for the mask as well and multiply with the features to send the masked channel to zero.
      7. Take the softmax of the output to normalize the contextual attention.

The paper also proposes the attention visualization to indirectly gauge where the network is paying the most attention to generate the missing region. For a more detailed explanation, check out the paper. Below are some results of the Contextual Attention network.




This work produces the missing region of the image using an existing U-net + partial convolution. Partial convolution is a convolution operation that uses only the non-masked region.

The operation takes in both the image and the mask and produces the feature and a slightly filled mask. Therefore, as you apply many partial convolutions, the mask region gets smaller and smaller.

This work is notable because it only adds the partial convolution and combination of popular losses on the existing U-net architecture to generate remarkable quality images. To train this network, the authors propose the following losses.

  1. Valid Loss (L1 loss on the non-masked region)
  2. Hole loss (L1 loss on the masked region)
  3. Perceptual Loss

  4. Style Loss

  5. Total Variation Loss


Combining Unet and the partial convolution with the above 5 losses, you can train the network to carry out the inpainting task as below.



The image inpainting is an active field of research. Though not mentioned in this article, check out the paper “Free-Form Image Inpainting with Gated Convolution” to see how the authors have put together the partial convolution and contextual network to overcome the limitations in the previous works.

Posted by:Minchul David Kim

Researcher @Lunit

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s