CycleGAN-and-Pix2Pix Run-on-Ainize(Eng,Kor)

Jan 22, 2020

Overview

GAN (Generative Adversarial Nets)

In the paper released by Goodfellow on NIPS in 2014, the GAN ‘boom’ began to explode very loudly in 2016.

GAN is an artificial intelligence algorithm used in Unsupervised Learning, implemented by two neural network systems that compete with each other within the Zero-Sum framework.

Conceptually, it learns in the mode which makes 'generator' making image competes 'discriminator' which distinguish whether image is real or fake. The creator creates fake content like a currency forger, and the discriminator judges whether it is close to the real, like the police, and as the process continues to repeat, the creator creates content that is hard to determine whether the discriminator is real or fake.

In short, it is the first Unsupervised Learning model to study for oneself beyond the limits of human teaching and learning.

Professor Yann Lecun of Facebook has shown tremendous achievements in various fields such as Natural Language Processing (NLP) in addition to Image Generation, which is the best idea of machine learning in recent 10-20 years.

Photo 1 - nvidia StyleGAN - Virtual characters created based on famous celebrity photos

The author of this paper is Phillip Isola, a former researcher of Professor Alexei A.Efros of UC Berkeley, and deals with image-to-image translation using Conditional General Adversial Networks (cGANs).

Based on the model of this paper, the cycle-GAN and Pix2Pix models are implemented using the cGANs model in the pyrtorch-CycleGAN-and-pix2pix, and the models are trained and evaluated based on at least 500 to up to 500,000 data according to each option (the pretrained file can be downloaded by reference to the git).

And in the next link, the pretrained model of the above code provides run-on-ainized-pytorch-cyclegan-and-pix (http://34.85.23.67:80) that allows you to run and check some options without any other configuration.

Model architecture

Pix2Pix

Photograph 2 - When it made input like the real image, the real image (Groun truth)

and CNN output(L1)

First, let's find out why this architecture was introduced in this paper.
Past studies related to this thesis have been studied in the Image to Image Mapping Network by pursuing Photo-realistic. The right-hand column (L1) is used only for simple network.

When we first look at the general methodology when we make Image to Image Translation, whenwe make Label into actual Image, the ground truth of Label and actual Image exists as Pair, and Input Label tries to create actual Image through CNN Network.

CNN solves the problem based on the above Loss, which tries to reduce the Loss from the viewpoint of the whole pixel rather than trying to find the most perfect answer for each pixel. Sothe pixel value is estimated to be safe rather than estimating the correct value.

This is a natural choice from the network perspective, but when people see it, they do not have a photo-realistic problem.
So, since GAN began to become popular, when Image to Image Translation was implemented, I introduced GAN's Adversial Training as a solution to implement a model that creates higherPhoto-realistic.

Photograph 3 - Encoder - Decoder and U-net Structure

In Pix2Pix, the performance of the architecture was optimized using the Generator Network and U-Net.
The biggest problem of Pix2Pix is that the resolution of input and output is the same as that of colorization. And it has the properties of maintaining the detail and shape.
Therefore, the U-Net structure is connected to the disadvantage of information loss when usingthe Encoder-decoder structure as shown in the figure.
And in the discriminator, we used PatchGAN.
PatchGAN is a role of discriminator in existing GAN, which determines whether it is real or fake by looking at the whole image, which makes it the overlap patch unit of Image.
Then, in the patch unit, Loss is back-propagate, so that the Generator gets the feedback in a moredetail part.
When the final is formed with Fully Convolutional by applying FCN Idea in actual implementation, Real/Fake Score Map is not Single Real/Fake Score.
This is said to be defined by the same size of Label (ex. Real, 1, Fake, 0) and Optimize.

Photograph 3 - Pix2Pix Output

The above images are the various reuslts of Pix2Pix.
Up to 3,000 copies were used in each traning data, and only 400 copies were used when the picture was changed from architecture to photo.
Also, take the batch size to 1 or 4 in training, Batch Normalization at 4 and Instance Normalizationat 1.

Cycle-GAN

CycleGAN is a paper that follows the lab that published Pix2Pix. The title of this paper is UnpairedImage-to-Image Translation using Cycle-Consistent Adversial Networks, which is the core of Unpaired.
You can learn with Paired Dataset like Pix2Pix.
But in reality, there is no such thing, and in the case of Style Transfer, which is mainly dealt with by CycleGAN (for example, converting Monet's picture into a picture), there was no Pair Data.
Therefore, this paper aims to solve this problem with the idea of Cycle Consistency.
The largest contribution was to introduce Cycle Consistency to Pix2Pix to make it work on Unpaired Dataset.
ResNet, LSGAN, PatchGAN, etc. were used for high-resolution Style Transfer.

Photograph 4 - Multiple Outcomes of CycleGAN

The above photographs are the results of CycleGAN: they turn pictures into pictures, zebras, summer into winter, apples into oranges, and improve the quality of the picture.
The disadvantage of CycleGAN, which the author directly mentioned, is slow because it is high in resolution and deep in network. Also, due to the characteristics of the Resnet structure and the L1Loss, there are several failure cases as shown below, which are said to be difficult to change the form itself, although it is good to change the Style while maintaining the form.

For more detailed explanations of the above, refer to Reference to the details.

Reference

[1] Image-to-Image Translation with Conditional Adversarial Networks, Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros, CVPR 2017(https://paperswithcode.com/paper/image-to-image-translation-with-conditional)

[2]https://taeoh-kim.github.io/blog/gan%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-image-to-image-translation-pix2pix-cyclegan-discogan/

[3] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, Jun-Yan Zhu, Taesung Park, Phillip Isol a and Alexei A. Efros, ICCV 2017(https://arxiv.org/abs/1703.10593)

——————————————————————————————————————

개요

GAN(Generative Adversarial Nets)이란?

2014년에 Goodfellow가 NIPS에서 발표한 Paper에서 GAN붐은 2016년에 아주 크게 터지기 시작했다.

GAN은 비지도 학습에 사용 되는 인공지능 알고리즘으로, Zero-Sum 틀 안에서 서로 경쟁하는 두 개의 신경 네트워크 시스템에 의해 구현된다. 개념적으로 이미지를 만드는 '생성자(generator)'와 이미지가 진짜인지 가짜인지 감별하는 '감별자(discriminator)'를 경쟁하게 만드는 방식으로 학습한다. 생성자는 화폐 위조꾼처럼 가짜 콘텐츠를 만들어내고 감별자는 경찰처럼 진짜와 가까운지 아닌지를 판단하는데, 이 과정이 계속 반복되면서 생성자는 감별자가 진짜인지 가짜인지 판단하기 힘든 콘텐츠를 만들어낸다.

쉽게 말하자면, 인간의 지도학습의 한계를 벗어나 스스로 공부해야할 것을 찾아 공부하는 최초의 비지도 학습 모델인 셈인 것이다.

Facebook의 Yann Lecun 교수가 근 10-20년 간에 기계학습 나온 아이디어 중 최고라 할 정도로 Image generation 외에도 Natural Language Processing(NLP) 등 다양한 분야에서 엄청난 성과들을 보여주고 있다.

사진 1 - nvidia StyleGAN - 유명 연예인 사진을 바탕으로 만들어낸 가상의 인물들

본 논문의 저자는 Phillip Isola로서, UC Berkeley의 Alexei A.Efros교수의 연구실 출신이며,

Conditional Generative Adversarial networks(cGANs)을 활용한 image-to-image translation을 다뤘다.

본 논문의 모델을 기반으로 pytorch-CycleGAN-and-pix2pix에선 cGANs model을 이용하여 Cycle-GAN, Pix2Pix 모델을 구현하였다. 그리고 각 옵션에 따라 최소 400개에서 최대 3000개의 데이터를 바탕으로 모델을 훈련하고 평가하였다(pretrained된 file은 git을 참조하면 다운받을 수 있다).

그리고 아래 링크에서 위 코드의 pretrained model로 일부 옵션들을 별 다른 환경설정 없이 바로 실행시키고 확인할 수 있는 run-on-ainized-pytorch-cyclegan-and-pix2pix (http://34.85.42.255/)가 제공되어 있다.

모델 아키텍쳐

Pix2Pix

사진2 - 입력을 실제 이미지처럼 만든다고 했을 때 실제 이미지(Groun truth)와 CNN 출력(L1)

먼저 본 논문에서 해당 아키텍처를 도입한 이유를 알아보자.

해당 논문 이전에 연구들은 Image to Image Mapping Network에서 Photo-realistic을 추구해서 연구해왔다. 맨 오른쪽 열 (L1)이 단순 Network만 사용한 경우이다.

Image to Image Translation을 할 때 일반적인 방법론을 먼저 보자면 위와 같이 Label을 실제 Image로 만든다고 했을 때, Label과 실제 Image의 Ground Truth가 Pair로 존재하게 되고, Input Label은 CNN Network를 통해 실제 Image를 만들어내려 한다.

CNN은 위 Loss를 기반으로 문제를 풀게 되는데 CNN은 각 픽셀별로 가장 완벽한 답을 찾으려고 하기보다는 전체 픽셀의 관점에서의 Loss를 줄이려 하게 된다. 그래서 픽셀 값이 정확한 값을 추정하기 보단 안전한 값으로 대충 추정하고 넘어간다는 것이다.

이것은 Network의 관점에서는 너무나도 당연한 선택이지만 사람이 봤을때는 사람이 보았을 때, Photo-realistic 하지 않다는 문제가 발생한다.

그래서 GAN이 유행하기 시작한 이후, Image to Image Translation을 할 때, 보다 더 높은 Photo-realistic 을 만들어내는 모델을 구현하기 위해 해결책으로 GAN의 Adversarial Training을 도입하게 되었다.

사진 3 - Encoder - decoder와 U-net 구조

Pix2Pix에서는 해당 아키텍처에 Generator Network와 U-Net을 사용하여 성능을 최적화 하였다.

Pix2Pix의 가장 큰 문제를 보면, Colorization과 마찬가지로 입력과 출력의 Resolution이 동일하다. 그리고 다소 Detail과 Shape를 유지하는 성질을 가지고 있다.

그래서 그림과 같이 Encoder-decoder 구조를 사용하게 되면 정보의 손실이 발생하게 되는 단점을 Skip-Connection을 이용해서 연결해 준 것이 U-Net구조이다.

그리고 Discriminator에선 PatchGAN이라는 것을 사용했다.

PatchGAN은 기존의 GAN에서 Discriminator의 역할은 Image 전체를 보고 진짜인지 가짜인지를 판별하게 되는데 이것을 Image의 Overlap되는 Patch 단위로 하게 해주는 것이다.

그러면 Patch 단위로 Loss가 Back-propagate되어서 좀 더 Detail한 부분에서 Generator가 Feedback을 받는다고 한다.

실제 구현시에는 FCN의 Idea를 적용해서 Fully Convolutional로 마지막을 구성하게 되면 Single Real/Fake Score가 아닌 Real/Fake Score Map이 나오게 된다.

이것을 이와 동일한 Size의 Label (ex. Real이면 1, Fake면 0)과의 Loss를 정의해서 Optimize를 하면 된다고 한다.

사진 3 - Pix2Pix 결과물

위 이미지들은 Pix2Pix의 여러 결과물들이다. 각각의 Traning Data에서 최대 3000장 정도가 사용되었으며, Architecture에서 Photo로 바꾸는 경우에는 400장밖에 사용하지 않았다. 또한, training 시 batch size를 1 또는 4로 작게 가져가고 4일때는 Batch Normalization을, 1일때는 Instance Normalization을 사용한다.

Cycle-GAN

CycleGAN은 위 Pix2Pix를 발표한 연구실에서 이어서 나온 논문이다. 논문 제목은 Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks로 핵심은 Unpaired에 있다.

Pix2Pix처럼 Paired Dataset으로 학습을 하면 좋다. 하지만 현실에서는 이런 경우는 거의 없고, CycleGAN에서 주로 다루고 있는 Style Transfer의 경우 (예를 들면 모네의 그림을 사진으로 변환)에는 당연히 Pair로 된 Data가 있을리 없었다.

따라서 본 논문에서는 Cycle Consistency라는 Idea로 이 문제를 해결하고자 하였다.

가장 큰 Contribution은 Pix2Pix에 Cycle Consistency를 도입하여 Unpaird Dataset에도 동작하게 만드는 것이었다.

고해상도의 Style Transfer를 목적으로 ResNet, LSGAN, PatchGAN등을 사용했다.

사진 4 - CycleGAN의 여러 결과물

위 사진은 CycleGAN의 결과물들이다. 그림을 사진처럼, 말을 얼룩말로, 여름을 겨울로, 사과를 오렌지로 바꿔 주고 사진의 화질을 향상시켜주기도 한다.

저자가 직접 언급한 CycleGAN의 단점으로는 우선 해상도가 크고 Network가 깊기 때문에 느리다. 또한 Resnet 구조와 L1 Loss의 특징들 때문 아래 그림과 같이 여러 실패 Case가 존재하는데, 형태를 유지하면서 Style을 바꾸는 것은 잘 하지만, 형태 자체를 바꾸는 것은 어렵다고 한다.

위 내용의 보다 더 자세한 설명은 Reference를 참고하면 자세하게 알 수 있다.

Reference

[2]https://taeoh-kim.github.io/blog/gan%EC%9D%84-%EC%9D%B4%EC%9A%A9%ED%95%9C-image-to-image-translation-pix2pix-cyclegan-discogan/

[3] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, Jun-Yan Zhu, Taesung Park, Phillip Isol a and Alexei A. Efros, ICCV 2017(https://arxiv.org/abs/1703.10593)

HojunPark’s Newsletter

Discussion about this post