DALL-E Explainer — Diffusion

Original Content

DALL-E: Finding fake history and imagining real futures with text-to-image generators.

August 29, 2022

Have you ever wondered when the ‘avocado as a display of generational wealth’ thing started?

I recently found these ink and pastel drawings from the 1770s showing a smug French queen lauding her haul.

Except, I didn’t. This is an entirely fake piece of art. Made by a world-leading piece of artificial intelligence.

Using DALLE2, you can describe anything you want and get a picture of it.

It’s a new AI research prototype launched in January and is currently being tested by a group of academics and advocates I’m in. It’s set to be made publicly available really soon, and you can join the waitlist here.

You can see what a future Australian Prime Minister will look like.

Or while running late for the bus and feeling pretty stressed about it.

You can see what Sally McManus would look like if she was made out of cotton balls or what Kevin Rudd would look like if he finally got a fair shake of the sauce bottle.

But more seriously, how does this even work? What does it tell us about the future, and does it even matter?

We’ve got to go back a few years to properly answer this question.

In 2015, a major development in AI research was automated image captioning.

Machine learning could already label objects in images — for example, identify that this was a cat — but suddenly, researchers could teach the algorithm to put those labels into natural language descriptions. Making it much more useful to do text searches of image libraries and making the web more accessible for the vision impaired.

This discovery made one group of researchers curious.

What if you flipped that process around?

If we can take an image and create an entirely new text.

Why not take text to try to create an entirely new image?

Not just a new image cut together from existing ones, but an image of something the computer had never seen before. Can it use imagination?

Could it show us, for example, some purple bananas?

And they pulled it off! Creating a 32 by 32 pixel image that is… very technically what we asked for.

However, the 2016 paper still proved the concept was possible, even if there was much more work to do.

And then, somehow, barely months later…

Our purple bananas went from pixelated and blurry to vividly photo-realistic.

Built by a US Research Lab called OpenAI, which is working out how to build AI systems that are both effective and ethical from the start.

If you’re an avid meme-r, you’ve probably seen a flood of things that look similar to DALL-E’s outputs over the last few months.

With free and publicly available tools like Craiyon and MidJourney, to name two, reverse engineering DALL-E’s methodology. But at much smaller scales and with much less sophistication.

The challenge with these, though, is that, as with most AI, they can get very good at a single purpose — a realistic human portrait, spotting brain cancers, playing music when you tell it to.

Real faces, as shown in “AI-synthesized faces are indistinguishable from real faces and more trustworthy”

DALL-E isn’t like that. It is remarkable because it can be used in a multitude of different ways to create a multitude of different outputs. There is so much more breadth.

For an image generator to be able to respond to such a wide variety of cues — an enormous amount of incredibly diverse data is required.

For example, DALL-E 2 uses hundreds of millions of images scraped from the internet, as well as the text descriptions hidden in their meta-data — traditionally written by humans loading images online and used to make images more accessible and easier for conventional search engines which can’t read images. How antiquated of them.

But once you’ve collected all this stuff, what does the AI do with it?

We might assume that when given a text prompt, like “a golden retriever driving a convertible by the beach”, it skims through the training data, finds some corresponding photos and copies some of those pixels over. Just chucks it in there.

But that’s not what it does.

The newly generated image doesn’t copy elements from the training data; it is a wholly original and new image — emerging from the “latent space” (or multi-dimensional space) of the deep learning model.

And this new image has to be wholly original because of how the model learns.

Have a look at these three images. Which caption goes with which image? If you’ve finished Kindergarten, I’m sure it is a breeze.

But this is what the computer sees. Just a grid of pixel values for red, green and blue. Can you match each image to its caption now? Which bit is the banana?

You probably need to just guess — and that is what the computer does at the start too.

Maaaybbee… that?

You could do this thousands and thousands of times and never improve

The computer starts that way too — but it doesn’t get bored and it doesn’t forget when it fails. It just keeps going and going until it starts to get it right.

Eventually, figuring out which combination of pixels is typically associated with the word you’ve used.

That is what deep learning does. It just really fixates on something for a long time until it starts to see a pattern.

It identifies and scores these patterns, looking for qualities that can separate images or concepts mathematically.

For example, if we wanted to tell the difference between an apple and a tennis ball, then perhaps we could start by measuring the amount of red in an image, putting the Apple over here and the Tennis ball over here on a one-dimensional axis.

Simple.

But colour data alone doesn’t define whether one object is an Apple or a Tennis Ball. We can have green apples too. So we need some more variables.

Maybe we add another axis and score their shape; apples have a fairly distinctive outline, and so do balls. But we run into another problem. Some apples can be shaped a bit like a tennis ball.

So maybe we need to add detailing — Tennis Balls typically have white stripes on fuzzy skin while apples do not.

And now, we’ve created a three-dimensional space.

And ideally, when we get a new image, we can measure those three variables and see whether it falls in the Apple region or the Tennis Ball region of the space.

But what if we want our model to recognise not just Tennis Balls and Apples but…all sorts of other things?

Redness, shape and lines don’t capture what’s distinct about these objects.

That’s what deep learning algorithms try to spot as they go through all the training data.

They find variables that help improve their judgement, building out a mathematical space with way more than three dimensions.

DALL-E ultimately creates more than 500 axes representing variables that humans wouldn’t even recognise or have names for, and is hard to visualise, but the result is that the space has meaningful clusters:

A region that captures the essence of Appleiness.
A region that represents the textures and colours of photos taken at the beach.
An area for snow and an area for golden retrievers and convertibles somewhere in between.

Any point in this space can be thought of as the recipe for a possible image.

The text prompt is what navigates us to that specific location within that machine’s latent space.

This is one of the two major technologies behind DALL-E 2, which its engineers call CLIP.

But then there’s one more step. Which is just as impressive.

Once we’ve found the spot we want within a mathematical space, we need to translate it into an actual image. Using the second significant technology in DALL-E 2, a process to develop images called ‘diffusion’.

Diffusion hopes to take what the computer is seeing and turn it into something that makes sense.

Technically, it starts with just noise and then, over a series of iterations, strips it back and arranges pixels into a composition that looks attractive to humans. It is the artist.

Because of some randomness in the process, it will never return exactly the same image for the same prompt.

And if you enter the prompt into a different model designed by different people and trained on different data, you’ll get a different result.

Because you’re in a different latent space.

There are some pretty big technical challenges here too -

it’s pretty crap at determining the relative positions of an object in an image. For example, you can’t ask it to put a red cube on top of a blue cube.
It’s also really, really, really bad at words. The “solution to climate change”, is to the right, apparently.

But the risk of misuse is its biggest challenge. This thing could theoretically generate disinformation at a speed you’ve never seen.

Fortunately, the whole reason OpenAI built this was to try and identify what issues might emerge so they could solve them.

It doesn’t include violence, nudity, or specific people in its dataset. Therefore it can’t generate them because these ideas don’t exist in their latent space. It literally doesn’t know what they are.

Although, as we can see from the examples of Sally McManus and Kevin Rudd, it does have SOME sense of who public figures are, even if it isn’t meant to.

More concerningly are elements of often unseen social biases embedded in the dataset.

Every Prime Minister it generated was an old white guy

If you ask for images of nurses, they’re all like women.

In one open-sourced dataset, the word “asian” is represented first and foremost by an avalanche of porn.

It’s just an infinitely complex mirror that’s being held up to our society, revealing the patterns and the things we thought were respectable enough to post online in the first place.

It is certainly technically impressive. The technology underpinning it will be the building block for tomorrow’s advancements.

Advancements that will happen more and more quickly as innovations like quantum computing start to support the already break-neck speed of AI technology.

Posing bigger questions:

What is art?
What counts as imagination?
How does this affect people who work in the creative industries?
Will it empower otherwise non-skilled people, and usher in a new age of expression?
What are it’s risks to society?

What do you think?

References

Video

Open AI (2022), “DALL-E 2 Explained”
https://www.youtube.com/watch?v=qTgPSKKjfVg
Google Developers (2016), “A.I. Experiments: Visualizing High-Dimensional Space”
https://www.youtube.com/watch?v=wvsE8jm1GzE
Vox Media (2022), “The AI that creates any picture you want, explained”
https://www.youtube.com/watch?v=SVcsDDABEkM
Roman De Giuli (2019), “MAGIC FLUIDS HDR”
https://www.youtube.com/watch?v=1MieluM0c6c
Kurzgesagt — In a Nutshell (2015), “Quantum Computers Explained — Limits of Human Technology”
https://www.youtube.com/watch?v=JhHMJCUmq28
Marques Brownlee (2022), “DALLE: AI Made This Thumbnail!”
https://www.youtube.com/watch?v=yCBEumeXY4A&t=226s
The Studio (2022), “Can AI Replace Our Graphic Designer?”
https://www.youtube.com/watch?v=MwAAH9tBoMg&t=0s
Simplilearn (2022), “Dall E 2 Explained In 5 Minutes!”
https://youtu.be/O_j_7Zdt7hg
Dr Ben Miles (2022), “What is Dalle 2? The Dark Side of Ai Art Breakthrough Explained”
https://youtu.be/PdfFRlabohg
Design Theory (2022), “Will Artificial Intelligence End Human Creativity?”
https://www.youtube.com/watch?v=oqamdXxdfSA

Academic

Birhane, A., Prabhu, V.U. and Kahembwe, E. (2021) ‘Multimodal datasets: misogyny, pornography, and malignant stereotypes’. arXiv. Available at: https://doi.org/10.48550/arXiv.2110.01963.
Karpathy, A. and Fei-Fei, L. (2015) ‘Deep Visual-Semantic Alignments for Generating Image Descriptions’. arXiv. Available at: http://arxiv.org/abs/1412.2306
Kiros, R., Salakhutdinov, R. and Zemel, R.S. (2014) ‘Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models’. arXiv. Available at: http://arxiv.org/abs/1411.2539
Li, W. et al. (2019) ‘Object-Driven Text-To-Image Synthesis via Adversarial Training’, undefined [Preprint]. Available at: https://doi.org/10.1109/CVPR.2019.01245.
Mansimov, E. et al. (2016) ‘Generating Images from Captions with Attention’. arXiv. Available at: https://doi.org/10.48550/arXiv.1511.02793.
Pu, Y. et al. (2016) ‘Variational Autoencoder for Deep Learning of Images, Labels and Captions’. arXiv. Available at: http://arxiv.org/abs/1609.08976
Reed, S.E., Akata, Z., Yan, X., et al. (2016) ‘Generative Adversarial Text to Image Synthesis’, undefined [Preprint]. Available at: https://www.semanticscholar.org/reader/6c7f040a150abf21dbcefe1f22e0f98fa184f41a
Reed, S.E., Akata, Z., Mohan, S., et al. (2016) ‘Learning What and Where to Draw’, undefined [Preprint]. Available at: https://www.semanticscholar.org/reader/cad4ac0d2389a89cf1955dd4788278c1e8ac1af9
Vedantam, R., Zitnick, C.L. and Parikh, D. (2015) ‘CIDEr: Consensus-based Image Description Evaluation’. arXiv. Available at: http://arxiv.org/abs/1411.5726
Vinyals, O. et al. (2015) ‘Show and Tell: A Neural Image Caption Generator’. arXiv. Available at: https://doi.org/10.48550/arXiv.1411.4555.
Xu, K. et al. (2016) ‘Show, Attend and Tell: Neural Image Caption Generation with Visual Attention’. arXiv. Available at: http://arxiv.org/abs/1502.03044
Zia, T. et al. (2020) ‘Text-to-Image Generation with Attention Based Recurrent Neural Networks’, undefined [Preprint]. Available at: https://www.semanticscholar.org/reader/1fbcf3b73719e3fbdb0ce6910a1bf3f3e16d15c3