Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
OpenAI releases Consistency Model for one-step generation (github.com/openai)
83 points by moelf on April 12, 2023 | hide | past | favorite | 27 comments


This is a very significant paper. The first diffusion paper required 1,000 steps to generate an image. By last November, they were down to 50 steps. This paper takes us to one or two steps. What we're talking about here is nothing short of magic. You give random noise to a computer and it generates a realistic image of whatever you want in just one or two iterations.

Correct me if I'm wrong, but I didn't see anything about training time or cost. I would be interested to know whether it is more or less expensive to train this model than it is to train Stable Diffusion.


It had already been done it seems: https://twitter.com/EMostaque/status/1598131202044866560. But nothing was published AFAIK so maybe it will finally get the ball moving for a publicly available distilled model.


Yes. I recall that someone in the FastAI course mentioned an amazing speed up but then I heard nothing further. I suppose maybe it wasn’t quite ready for prime time yet.


1. Consistency models are a new type of generative models designed specifically for efficient one-step or few-step generation. They achieve high sample quality without adversarial training. 2. Consistency models can be trained in two ways: (1) Consistency distillation: distilling a pretrained diffusion model into a consistency model. This results in high quality one-step generation. (2) As a standalone generative model without relying on a pretrained diffusion model. This still achieves strong performance for one-step generation, outperforming other non-adversarial single-step generative models. 3. Consistency models allow trading off compute for sample quality by using multistep generation, similar to diffusion models. They also enable zero-shot image editing applications like diffusion models. 4. Empirically, consistency distillation outperforms existing distillation techniques for diffusion models like progressive distillation, achieving state-of-the-art FID scores on CIFAR-10, ImageNet 64x64, and LSUN 256x256 for one-step and multi-step generation. 5. As standalone generative models, consistency models outperform other single-step, non-adversarial generative models on CIFAR-10, ImageNet 64x64, and LSUN 256x256, though not GANs. 6. Consistency models share similarities with techniques in deep Q-learning and momentum-based contrastive learning, indicating potential for cross-pollination of ideas. 7. Some limitations and future work include: - Evaluating consistency models on other modalities like audio and video. - Exploring connections to deep Q-learning and contrastive learning in more depth. - Developing more sophisticated training methods for consistency models. - Improving the efficiency and stability of the multistep sampling procedure.


I guess in a way, OAI chose the right approach in neglecting Dall-E and basically left the scene to Midjourney and Stable Diffusion. I think with their capability and resources, OAI can easily make something competitive to the current Midjourney and SD but they instead focused on the bigger picture to better fit their capabilities.

Small models on the scale of a few billions parameters is probably best left to crowd sourcing efforts because almost everyone can do it. Dealing with large models like GPT and engineering entirely new approaches like this one is more expensive and not easily done by the community. So I guess it is the most efficient use of resources on both side.


They actually have the new "experimental" dall-e model which is used in bing image creation tool. I don't know how it compares to recent midjourney/sd but it looks quite good I think.


Maybe there was something strategic about Dall-E, used for some sort of dataset collection/tuning for GPT4.

Like how Whisper unlocks a lot more text data for GPT4.


They’re probably just working toward a new image model powered by GPT-4’s image capabilities.


tl;dr, a faster alternative to diffusion models for image and A/V generation.

Abstract of the paper:

> Diffusion models have made significant breakthroughs in image, audio, and video generation, but they depend on an iterative generation process that causes slow sampling speed and caps their potential for real-time applications. To overcome this limitation, we propose consistency models, a new family of generative models that achieve high sample quality without adversarial training. They support fast one-step generation by design, while still allowing for few-step sampling to trade compute for sample quality. They also support zero-shot data editing, like image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either as a way to distill pre-trained diffusion models, or as standalone generative models. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step generation. For example, we achieve the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained as standalone generative models, consistency models also outperform single-step, non-adversarial generative models on standard benchmarks like CIFAR-10, ImageNet 64x64 and LSUN 256x256.

https://arxiv.org/abs/2303.01469


Am I correct the upshot is the Stable Diffusion community is going to take this and make more weird shit? Because that is a good thing, especially since Ope AI has pretty much neglected Dall-E since launch.


They also neglected Point-E, for 3D object creation.


To be fair, i never was able to create any 3D object with Point-E that was anything more than one step above a blob. Considering the reaction ChatGPT i don't blame them for reallocating resources if that's what happened.


TIL: https://openai.com/research/point-e

Came out back in Dec '22


This looks like important research, but the pre-trained models may not do what you hope. From the model card [1]:

> These models sometimes produce highly unrealistic outputs, particularly when generating images containing human faces. This may stem from ImageNet's emphasis on non-human objects.

I guess we’ll see what other companies do with this research? It would be great to have image generation times that are closer to image search.

[1] https://github.com/openai/consistency_models/blob/main/model...


Can anyone ELI5 how the hell this is possible? I think I have a decent model of how diffusion models work: They're trained by teaching them to denoise a bunch of images, so they eventually learn how to create images out of noise and the prompt biases them as to what they should "see" in the noise. But this just seems like absolute magic.


Are they going to make something like a Dalle-3 based on this technology that is trained on like, every picture, and runs on their giant GPU farms? I think this academic paper only uses relatively tiny research data sets to compare with other methods.


Explain this to me like I don’t know this subject


Think of Stable Diffusion but doesn't need the expensive multi-step diffusion yet still achieve similar result.


Why they open source it?


It's one thing to come up with the technique; quite another to source the dataset of images on which the model will be trained, train it up, and then run it as a service for millions of users.


huge computation train on red, blue, and yellow colors and generate shape I wonder what would come out in 100B parameter training data and then 2 steps require a lot of computation not efficient (IF it would work at all? )


For other researchers. It doesn’t look like the pre-trained models will be good enough to use?


[flagged]


Curious as why you think it is noise? My feeling, which may be wrong, is the stuff happening right now is going to have a massive impact and in this moment of chaos a new order is being formed.

The combination of tech layoffs, dams breaking at corporate labs who tried to control this stuff, random people getting amazing new powers to build weird ideas, and a new programming paradigm really makes this year feel like one for the history books.

I’m just overwhelmed most of the time!


There is a lot of other stuff happening right now that may or may not have a massive impact. Dominating the front page like this is just tilting the attention economy too much for everybody who hasn't bought into the hype bubble.


There are only a few AI/LM posts per page for me. There is no domination. Maybe you notice it every time, but that doesn't mean it's all you see.


You could use the GPT-4 API to classify posts as likely to be about GPT/LLM/AI and filter them out.

Let me code a PoC and submit it to HN!


It would only take you a few minutes to write one yourself using ChatGPT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: