2021 has been a year of unprecedented progress and a furious publication rate in the image synthesis industry, offering a stream of new innovations and improvements in technologies capable of reproducing human personalities through neural rendering, deepfakes and a multitude of new approaches.

However, German researchers now claim that the standard used to automatically judge the realism of synthetic images is fatally flawed; and that the hundreds, if not thousands of researchers around the world who rely on it to reduce the cost of expensive human-based outcome evaluation may be heading towards a dead end.

In order to demonstrate how the standard, Fréchet launch distance (FID), does not correspond to human standards for image evaluation, the researchers deployed their own GANs, optimized for the FID (now a common metric). They found that FID follows its own obsessions, based on underlying code with a very different mandate than image synthesis, and consistently fails to achieve a “human” level of discernment:

FID scores (lower is better) for images generated by various models using standard datasets and architectures. The researchers of the new article ask the question “Would you agree with these rankings?” Source: https://openreview.net/pdf?id=mLG96UpmbYz

In addition to its claim that the FID is unsuitable for its intended task, the document further suggests that the “obvious” remedies, such as replacing its internal engine with competing engines, will simply exchange a set of biases. against another. The authors suggest that the onus is now on new research initiatives to develop better measures to assess the “authenticity” of synthetically generated photos.

the paper is titled Internalized biases in Fréchet Inception Distance, and comes from Steffen Jung from the Max Planck Institute for Informatics Saarland, and Margret Keuper, Professor of Visual Informatics at the University of Siegen.

The search for a rating system for image synthesis

As the new research notes, advances in image synthesis frameworks, such as GANs and encoder / decoder architectures, have outstripped methods for evaluating the results of such systems. In addition to being expensive and therefore difficult to scale, human evaluation of the performance of these systems does not provide an empirical and reproducible evaluation method.

As a result, a number of metric frameworks have emerged, including Launch score (IS), presented in the 2016 paper Improved techniques for the training of GANs, co-written by GAN inventor, Ian Goodfellow.

The discrediting of the SI score as a widely applicable metric for several GAN ​​networks in 2018 led to the widespread adoption of FID in the GAN image synthesis community. However, like Inception Score, FID is based on the Inception v3 image classification network (IV3).

The authors of the new paper argue that Fréchet Inception Distance propagates damaging biases in IV3, leading to an unreliable classification of image quality.

Since the FID can be incorporated into a machine learning framework as a discriminator (a built-in ‘judge’ who decides whether the GAN is doing well or should ‘try again’), it must accurately represent the standards that a human would apply the images when evaluating.

Fréchet launch distance

FID compares how the features are distributed in the training dataset used to create a GAN model (or similar feature) and the results of that system.

Therefore, if a GAN frame is formed on 10,000 images of (for example) celebrities, the FID compares the original (real) images to the fake images produced by the GAN. The lower the FID score, the closer the GAN is to “photorealistic” images, according to FID criteria.

According to the article, the results of a GAN trained on FFHQ64, a subset of NVIDIA's very popular FFHQ dataset.  Here, although the FID score is wonderfully low at 5.38, the results are neither pleasant nor convincing for the average human.

According to the article, the results of a GAN trained on FFHQ64, a subset of the very popular NVIDIA FFHQ dataset. Here, although the FID score is wonderfully low at 5.38, the results are neither pleasant nor convincing for the average human.

The problem, the authors argue, is that Inception v3, whose assumptions fuel Fréchet Inception Distance, doesn’t look in the right places – at least, not when considering the task at hand.

Inception V3 is formed on the ImageNet Object Recognition Challenge, a task that is arguably at odds with how the goals of image synthesis have evolved in recent years. IV3 challenges the robustness of a model by performing a data augmentation: it flips the images randomly, crops them at a random scale between 8 and 100%, changes the aspect ratio (in a range of 3/4 to 4/3) and randomly injects color distortions related to brightness, saturation and contrast.

Researchers based in Germany found that IV3 tends to favor extraction of edges and textures, rather than color and intensity information, which would be more significant clues of authenticity for synthetic images; and that its original object detection objective was therefore inappropriately sequestered for an inappropriate task. The authors specify *:

‘[Inception v3] has a tendency to extract features based on edges and textures rather than color and intensity information. This aligns with its augmentation pipeline which introduces color distortions, but keeps high frequency information intact (unlike, say, augmentation with Gaussian blur).

“Therefore, the FID inherits this bias. When used as a ranking metric, generative models that reproduce textures well may be preferred over models that reproduce color distributions well.. ‘

Data and method

To test their hypothesis, the authors formed two GAN architectures, DCGAN and SNGAN, on NVIDIA FFHQ Human Face Data Set, downsampled to 642 image resolution, with the derived dataset called FFHQ64.

Three GAN training procedures were followed: GAN G + D, a standard network based on discriminators; GAN FID | G + D, where FID plays the role of additional discriminator; and GAN FID | G. where the GAN is fully powered by the rolling FID score.

Technically, the authors note, the loss of FID should stabilize formation, and potentially even be able to replace completely discriminator (as in # 3, GAN FID | G), while producing results pleasing to humans.

In practice, the results are quite different, with – the authors hypothesize – the models assisted by FID “overfitting” on the wrong metrics. The researchers note:

“We hypothesize that the generator learns to produce inappropriate functionality to match the distribution of the training data. This observation becomes more severe in the case of [GAN FID|G] . Here, we notice that the missing discriminator leads to spatially inconsistent feature distributions. For example [SNGAN FID|G] mostly adds simple eyes and lines up facial features in an intimidating way.

Examples of faces made by SNGAN FID | G.

Examples of faces made by SNGAN FID | G.

The authors conclude *:

“While human annotators would surely prefer the images produced by SNGAN D + G over SNGAN FID | G (in cases where data fidelity is preferred over art), we see that this is not reflected by FID. Therefore, the FID is not aligned with human perception.

“We argue that the discriminating characteristics provided by image classification networks are not sufficient to provide the basis for a meaningful metric.”

No easy alternatives

The authors also found that replacing the Inception V3 with a similar engine did not solve the problem. By replacing IV3 with “an extended choice of different classification networks”, which have been tested against ImageNet-C (a subset of ImageNet designed to compare the corruption and disturbance commonly generated in the output images of CGI frames), the researchers were unable to significantly improve their results:

[Biases] present in Inception v3 are also widely present in other classification networks. Additionally, we see that different networks would produce different rankings between types of corruption. ‘

The authors conclude the article with the hope that current research will develop a “humanly aligned and unbiased metric” capable of allowing a more accurate classification of image generator architectures.

* Emphasis of the authors.


First published on December 20, 2021, 1 p.m. GMT + 2.