JavaRoom 发布于2020-08 浏览:4859 回复:1

StarGAN v2 : Diverse Image Synthesis for Multiple Domains 不同图像多领域合成阅读理解

StarGAN v2: Diverse Image Synthesis for Multiple Domains

Yunjey Choi*, Youngjung Uh*, Jaejun Yoo*, Jung-Woo Ha

In CVPR 2020. (* indicates equal contribution)


Abstract: A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain variations. The code, pre-trained models, and dataset are available at clovaai/stargan-v2.
摘要: 一个好的图像-图像转换模型应该学习不同视觉域之间的映射,同时满足以下特性:
现有的方法解决了这两种问题中的一种,对所有领域具有有限的多样性或多个模型。我们提出了StarGAN v2,这是一个单一的框架,它解决了这两个问题,并且在基线上显示了显著改进的结果。在CelebA-HQ和一个新的动物面孔数据集*(AFHQ)上的实验验证了我们在视觉质量、多样性和可扩展性方面的优势。为了更好地评估图像到图像的转换模型,我们发布了AFHQ,这是一种高质量的动物面孔,具有较大的区域间和区域内差异。代码、预训练的模型和数据集可在clovaai/stargan-v2.***上获得
Introduction Image-to-image translation aims to learn a mapping between different visual domains [20]. Here, domain impliesa set of images that can be grouped as a visually distinctive category, and each image has a unique appearance, which we call style. For example, we can set image domains based on the gender of a person, in which case the style includes makeup, beard, and hairstyle (top half of Figure 1).An ideal image-to-image translation method should be able to synthesize images considering the diverse styles in each domain. However, designing and learning such models become complicated as there can be arbitrarily large number of styles and domains in the dataset.
简介 Image-to-image translation旨在学习不同视觉域之间的映射。这里,domain指的是一组图像,可以分组为一个视觉上独特的类别,并且每个图像都有一个独特的外观,我们称之为样式。例如,我们可以根据人的性别设置图像域,在这种情况下,样式包括化妆、胡须和发型(图1的上半部分)。理想的图像到图像的转换方法应该能够综合考虑各个领域不同风格的图像。然而,设计和学习这样的模型变得非常复杂。
To address the style diversity, much work on image-to-image translation has been developed [1, 16, 34, 28, 38, 54].These methods inject a low-dimensional latent code to the generator, which can be randomly sampled from the standard Gaussian distribution. Their domain-specific decoders interpret the latent codes as recipes for various styles when generating images. However, because these methods have
only considered a mapping between two domains, they are not scalable to the increasing number of domains. For example, having K domains, these methods require to train K(K-1) generators to handle translations between each and every domain, limiting their practical usage.
为了解决风格的多样性,大量关于图-图 转换的工作已经被开发出来[1,16,34,28,38,54]。这些方法将低维码注入到生成器中,并从标准高斯分布中随机采样。它们特定于领域的解码器在生成图像时将潜在代码解释为各种样式的方法。但是,由于这些方法只考虑了两个域之间的映射,因此它们不能扩展到不断增加的域。例如,有K个域,这些方法需要训练K(K-1)个生成器来处理每个领域,这限制了他的应用。
To address the scalability, several studies have proposed a unified framework [2, 7, 17, 30]. StarGAN [7] is one of the earliest models, which learns the mappings between all available domains using a single generator. The generator takes a domain label as an additional input, and learns to transform an image into the corresponding domain. However, StarGAN still learns a deterministic mapping per each domain, which does not capture the multi-modal nature of the data distribution. This limitation comes from the fact that each domain is indicated by a predetermined label.Note that the generator receives a fixed label (e.g. one-hot vector) as input, and thus it inevitably produces the same output per each domain, given a source image.
To get the best of both worlds, we propose StarGAN v2, a scalable approach that can generate diverse images across multiple domains. In particular, we start from StarGAN and replace its domain label with our proposed domainspecific style code that can represent diverse styles of a specific domain. To this end, we introduce two modules, a mapping network and a style encoder. The mapping network learns to transform random Gaussian noise into a style code, while the encoder learns to extract the style code from a given reference image. Considering multiple domains, both modules have multiple output branches, each of which provides style codes for a specific domain. Finally, utilizing these style codes, our generator learns to successfully synthesize diverse images over multiple domains (Figure 1)
为了兼顾这两种情况,我们提出了StarGAN v2,这是一种可扩展的方法,可以跨多个领域生成不同的图像。特别是,我们从StarGAN开始,用我们提议的特定于领域的风格代码替换它的域标签,这些代码可以代表特定领域的不同风格。为此,我们介绍了两个模块:映射网络和样式编码器。映射网络学习将随机高斯噪声转换为风格码,而编码器学习从给定的参考图像中提取风格码。考虑到多个域,两个模块都有多个输出分支,每个分支都提供特定域的样式代码。最后,利用这些样式代码,我们的生成器学会了在多个域上成功地合成不同的图像。
We first investigate the effect of individual components of StarGAN v2 and show that our model indeed benefits from using the style code (Section 3.1). We empirically demonstrate that our proposed method is scalable to multiple domains and gives significantly better results in terms of visual quality and diversity compared to the leading methods (Section 3.2). Last but not least, we present a new dataset of animal faces (AFHQ) with high quality and wide variations (Appendix A) to better evaluate the performance of image-to-image translation models on large inter- and intra-domain differences. We release this dataset publicly available for research community.
我们首先研究了StarGAN v2中各个组件的影响,并证明我们的模型确实从使用样式代码中获益(第3.1节)。我们的经验证明,我们提出的方法可扩展到多个领域,在视觉质量和多样性方面,与领先的方法相比,我们提供了更好的结果(第3.2节)。最后,我们提供了一个新的高质量和大差异的动物面孔数据集(AFHQ)(附录a),以更好地评估图像-图像转换模型在大域间和域内差异方面的性能。我们将这个数据集公开发布给研究社区。
2.StarGAN v2
2.1. Proposed framework
Let X and Y be the sets of images and possible domains, respectively. Given an image x 2 X and an arbitrary domain y 2 Y, our goal is to train a single generator G that can generate diverse images of each domain y that corresponds to the image x. We generate domain-specific style vectors in the learned style space of each domain and train G to reflect the style vectors. Figure 2 illustrates an overview of our framework, which consists of four modules described below.
设X和Y分别是图像和可能域的集合。给定图像x 2 x和一个任意域y 2 y,我们的目标是培养一个生成器G可以生成不同图像的每个域y对应于图像x。我们生成特定于域的风格向量在每个域和培训的学习风格空间G以反映风格向量。图2展示了我们的框架的概述,它由下面描述的四个模块组成。
Generator (Figure 2a). Our generator G translates an input image x into an output image G(x; s) reflecting a domainspecific style code s, which is provided either by the mapping network F or by the style encoder E. We use adaptive instance normalization (AdaIN) [15, 22] to inject s into G.
We observe that s is designed to represent a style of a specific domain y, which removes the necessity of providing y to G and allows G to synthesize images of all domains.
Mapping network (Figure 2b). Given a latent code z and a domain y, our mapping network F generates a style code s = F y(z), where Fy(·) denotes an output of F corresponding to the domain y. F consists of an MLP with multiple output branches to provide style codes for all available domains. F can produce diverse style codes by sampling the latent vector z 2 Z and the domain y 2 Y randomly. Our multi-task architecture allows F to efficiently and effectively learn style representations of all domains.
映射网络(图2b)。给定潜在代码z和域y,我们的映射网络F生成样式代码s = Fy(z),其中Fy(·)表示域y对应的F的输出。F由一个具有多个输出分支的MLP组成,提供所有可用域的样式代码。F可以通过随机抽取潜在向量z2z和域y2y来生成不同的风格编码。
Style encoder (Figure 2c). Given an image x and its corresponding domain y, our encoder E extracts the style code s = E y(x) of x. Here, Ey(·) denotes the output of E corresponding to the domain y. Similar to F, our style encoder E benefits from the multi-task learning setup. E can produce diverse style codes using different reference images. This allows G to synthesize an output image reflecting the style s of a reference image x.
样式编码器(图2c)。对于图像x和对应的领域y,我们的编码器E提取出x的风格代码s = Ey(x),其中Ey(·)表示E对应于领域y的输出,与F类似,我们的风格编码器E也从多任务学习设置中受益。E可以使用不同的参考图像产生不同的风格代码。这允许G合成一个反映参考图像x风格的输出图像。

Discriminator (Figure 2d). Our discriminator D is a multitask discriminator [30, 35], which consists of multiple output branches. Each branch Dy learns a binary classification determining whether an image x is a real image of its domain y or a fake image G(x; s) produced by G
鉴别器D(图2 d)。我们的鉴别器D是一个多任务鉴别器[30,35],它由多个输出分支组成。每个分支Dy学习一个二值分类,以确定图像x是其域y的真实图像还是伪图像G(x);s)由G生产。

2.2. Training objectives
Given an image x 2 X and its original domain y 2 Y, we train our framework using the following objectives.
Adversarial objective. During training, we sample a latent code z 2 Z and a target domain ye 2 Y randomly, and
在训练过程中,随机抽取潜码z2z和目标域ye 2y
对抗目标Adversarial objective

风格重构Style reconstruction

风格多样性Style diversification

保留源图特性Preserving source characteristics

总体目标Full objective

3.Experiments 实验
In this section, we describe evaluation setups and conduct a set of experiments. We analyze the individual components of StarGAN v2 (Section 3.1) and compare our model with three leading baselines on diverse image synthesis (Section 3.2). All experiments are conducted using unseen images during the training phase.
在本节中,我们将描述评估设置并进行一组实验。我们分析了StarGAN v2(章节3.1)的单个成分,并将我们的模型与不同图像合成的三个主要基线进行比较(3.2节)。所有的实验都是在训练阶段使用不可见的图像进行的。
Baselines. We use MUNIT [16], DRIT [28], and MSGAN [34] as our baselines, all of which learn multi-modal mappings between two domains. For multi-domain comparisons, we train these models multiple times for every pair of image domains. We also compare our method with StarGAN [7], which learns mappings among multiple domains using a single generator. All the baselines are trained using the implementations provided by the authors.
基线。我们使用MUNIT [16], DRIT[28]和MSGAN [34]作为我们的基线,所有这些都学习了两个域之间的多模态映射。对于多域比较,我们对每一对图像域进行多次训练。我们还将我们的方法与StarGAN[7]进行了比较,后者使用一个生成器学习多个域之间的映射。所有的基线都是使用作者提供的实现进行训练的。
Datasets. We evaluate StarGAN v2 on CelebA-HQ [21] and our new AFHQ dataset (Appendix A). We separate CelebAHQ into two domains of male and female, and AFHQ into three domains of cat, dog, and wildlife. Other than the domain labels, we do not use any additional information (e.g.
facial attributes of CelebA-HQ or breeds of AFHQ) and let the models learn such information as styles without supervision. For a fair comparison, all images are resized to 256 × 256 resolution for training, which is the highest resolution used in the baselines.
数据集。我们对CelebA-HQ[21]和我们新的AFHQ数据集(附录A)上的StarGAN v2进行了评估。我们将CelebAHQ分为雄性和雌性两个域,AFHQ分为猫、狗和野生动物三个域。除了域名标签,我们不使用任何额外的信息(例如: 让模特在没有监督的情况下学习风格等信息。为了进行公平的比较,所有图像的大小都被调整为训练分辨率为256×256,是基线中使用的最高分辨率。
Evaluation metrics. We evaluate both the visual quality and the diversity of generated images using Frechét inception distance (FID) [14] and learned perceptual image patch similarity (LPIPS) [52]. We compute FID and LPIPS for every pair of image domains within a dataset and report their average values. The details on evaluation metrics and protocols are further described in Appendix C.
评价指标。我们使用Frechet inception distance (FID)[14]和learning perceptual image patch similarity (LPIPS)[52]来评估生成图像的视觉质量和多样性。我们为数据集中的每对图像域计算FID和LPIPS,并报告它们的平均值。关于评估指标和协议的细节在附录C中有进一步的描述。
3.1. Analysis of individual components

3.2. Comparison on diverse image synthesis
4. Discussion 讨论
We discuss several reasons why StarGAN v2 can successfully synthesize images of diverse styles over multiple domains. First, our style code is separately generated per domain by the multi-head mapping network and style encoder. By doing so, our generator can only focus on using the style code, whose domain-specific information is already taken care of by the mapping network (Section 3.1).
Second, following the insight of StyleGAN [22], our style space is produced by learned transformations. This provides more flexibility to our model than the baselines [16, 28, 34], which assume that the style space is a fixed Gaussian distribution (Section 3.2). Last but not least, our modules benefit from fully exploiting training data from multiple domains. By design, the shared part of each module should learn domain-invariant features which induces the regularization effect, encouraging better generalization to unseen samples. To show that our model generalizes over the unseen images, we test a few samples from FFHQ [22] with our model trained on CelebA-HQ (Figure 7). Here, StarGAN v2 successfully captures styles of references and renders these styles correctly to the source images.
我们讨论了为什么StarGAN v2可以成功地在多个领域合成不同风格的图像的几个原因。
5. Related work相关工作
Generative adversarial networks (GANs) [10] have shown impressive results in many computer vision tasks such as image synthesis [4, 31, 8], colorization [18, 50] and super-resolution [27, 47]. Along with improving the visual quality of generated images, their diversity also has been considered as an important objective which has been tackled by either devoted loss functions [34, 35] or architectural design [4, 22]. StyleGAN [22] introduces a non-linear mapping function that embeds an input latent code into an intermediate style space to better represent the factors of variation. However, this method requires non-trivial effort when transforming a real image, since its generator is not designed to take an image as input.
6. Conclusion 结论
We proposed StarGAN v2, which addresses two major challenges in image-to-image translation; translating an image of one domain to diverse images of a target domain, and supporting multiple target domains. The experimental results showed that our model can generate images with rich styles across multiple domains, remarkably outperforming the previous leading methods [16, 28, 34]. We also released a new dataset of animal faces (AFHQ) for evaluating methods in a large inter- and intra domain variation setting.
我们提出了StarGAN v2,它解决了图像到图像转换中的两个主要挑战;将一个域的图像转换为目标域的不同图像,并支持多个目标域。实验结果表明,我们的模型可以跨多个域生成丰富风格的图像,显著优于之前的领先方法[16,28,34]。我们还发布了一个新的动物面孔数据集(AFHQ),用于在一个大的域间和域内变异设置中评估方法。
Acknowledgements 感谢
We thank the full-time and visiting Clova AI members for an early review: Seongjoon Oh, Junsuk Choe, Muhammad Ferjad Naeem, and Kyungjune Baek. All experiments were conducted based on NAVER Smart Machine Learning (NSML) [23, 43].
我们感谢全职和访问Clova AI成员的早期审查:Seongjoon Oh, Junsuk Choe, Muhammad Ferjad Naeem, Kyungjune Baek。所有实验都是基于NAVER智能机器学习(NSML)[23,43]。



共1条回复 最后由用户已被禁言回复于2022-04