OpenAI’s DALL-E 2 Converts Text from Humans into Beautiful AI-Generated Art

Guided by a few text-based commands from humans, DALL-E 2 not only generates but also edits digital images. Here is how it works and what it means for the future of art and design.

Apr 23, 2022

OpenAI’s initial release of the AI text-to-image generator DALL-E in January 2021 piqued the world’s interest but was quite limited. The generated images had low resolution and at times failed to capture the text commands from humans. Open AI's second and latest release of the same technology earlier this month (called DALL-E 2) addresses these shortcomings and has received rave reviews so far.

DALL-E 2 is a transformer-based language model that takes text commands from a user to create synthetic, photorealistic images. DALL-E 2 performs much than its predecessor, creating images with higher resolution and improved accuracy. It can even edit existing AI-generated images and perform complex tasks like inpainting, which involves injecting objects into an existing image.

What does this means for digital artists and designers? Does the technology have limitations that may hamper its adoption? Let’s explore these and more questions together.

This post is sponsored by Multimodal, a NYC-based development shop that focuses on building custom natural language processing solutions for product teams using large language models (LLMs).

With Multimodal, you will reduce your time-to-market for introducing NLP in your product. Projects take as little as 3 months from start to finish and cost less than 50% of newly formed NLP teams, without any of the hassle. Contact them to learn more.

What is DALL-E 2?

Before we dive deeper into DALL-E 2, for those of you unfamiliar with the original DALL-E model, here’s a basic introduction.

OpenAI, one of the world’s most important AI research labs, developed a system called DALL-E in January 2021. The name is a portmanteau of the surrealist painter Salvador Dali’s name and the robot WALL-E from the Pixar film of the same name.

DALL-E was based mainly on GPT-3 (a language model that can produce human-like text using some instructions) and could generate images from text commands. These images could be as simple as a potato or as complicated as three people playing video games on a couch.

However, the results of DALL-E were barely recognizable. The technology wasn't viable enough to produce high-quality images. Many times the model would stumble and struggle with creating the right images for a given textual prompt. Now, more than a year later, OpenAI has released DALL-E 2, a superior version of its predecessor.

DALL-E 2’s results are far more realistic and accurate. We could tell the model to generate an image of a monkey playing volleyball, and the model would generate that scene in multiple styles without any other guidance.

Here are images produced by DALL-E 2 along with their corresponding text prompts.

asdfsd — Prompt: An astronaut lounging in a tropical resort in space in a photorealistic style. (Source)

Prompt: A bowl of soup as a planet in the universe as mixed media needlework. (Source)

DALL-E 2 hasn’t been released to the public yet. However, many stunning test images transcending realism and incorporating multiple painting, photography, and design styles are already circling the web.

“The most delightful thing to play with we’ve created so far… and fun in a way I haven’t felt from technology in a while.” — OpenAI’s CEO Sam Altman (Source)

The Journey to DALL-E 2

DALL-E was unique in its ability to quickly produce images from text captions. It was like the text generator GPT-3 – an advanced autocomplete but for images instead of text. And just like early versions of autocomplete, DALL-E’s creations didn't make much sense and were off in terms of quality.

Baidu’s ERNIE-ViLG, released soon after DALL-E, was an improvement over OpenAI's model. But nothing has been able to match what DALL-E 2 is doing today.

DALL-E 2 can not only generate high-quality and highly realistic images from text prompts, but it can also edit, retouch, and in-paint photos seamlessly with similar prompts. It can even present the same image in different painting styles – minimalist, surrealist, hand-painted, etc.

While DALL-E was primarily based on GPT-3, DALL-E 2 works on the CLIP and diffusion models to do its job. CLIP helps DALL-E 2 learn visual concepts from natural language and make sense of the text prompt given by the user.

CLIP is a model where the AI learns the meaning of images by studying hundreds of millions of images and their captions. At the same time, it also learns to what degree an image is relevant to a caption. This helps it make better sense of the same abstract object in different contexts.

In the diffusion model, instead of training on complete images, the system trains on random, distorted pixels or tiny dots that make no sense on their own. In other words, it learns using unclear, very noisy images.

When the system is given a prompt, it rearranges these dots until they match the aspects of an image as closely as possible. How does it know whether or not the rearranged pixels correspond to the image prompt? By using its learnings of caption relevance from CLIP.

When we supply a text caption, DALL-E 2 uses a combination of CLIP and diffusion to develop a high-resolution image encompassing the meaning of the text.

Check out the following images for a better understanding.

Prompt: Teddy bears mixing sparkling chemicals as mad scientists as a 1990s Sunday morning cartoon. (Source)

Prompt: A bowl of soup as a planet in the universe as digital art. (Source)

The striking images above come from fairly complex prompts. Since DALL-E 2 uses CLIP, it understands better the meaning of different paintings or graphic styles. At the same time, its diffusion model helps it emulate these prompts into original, photorealistic images.

Another exciting aspect of DALL-E 2 is its ability to edit and retouch images. It can use existing pictures to come up with similar designs. After that, it can enter a feedback loop to produce more images based on the stated designs.

If we give it an image of a dog on a couch and ask it to replace it with one of a cute cat, it can do that, making sure the cat doesn't look out of place. Moreover, it can smoothly inpaint images; in other words, it can remove objects from pictures without distorting the background or making anything look unnatural. Although DALL-E 2 is not publicly available yet, many in the initial group of 400 testers have praised DALL-E 2’s inpainting capabilities.

See the images below to understand how DALL-E 2 executes inpainting:

We asked DALL-E 2 to insert a dog on the mattress lying on the floor and in each of the two paintings. Here are the three results.

You can see how natural each of the results looks. DALL-E 2's image generation has almost perfect inpainted pictures, keeping everything from shadows to colors suitable.

However, some of the current testers of the system have reported that not all inpainted images are as flawless. Sometimes, DALL-E 2 doesn’t correctly consider lighting or binding attributes.

Now that we've gotten to know the technology behind this AI system quite well, it's natural to wonder: what's in it for AI operators, investors, artists, and designers?

What DALL-E 2 Means for the Future of Art and Design

Given its current capabilities, DALL-E 2 could assist many graphic designers and digital artists. It could make minor edits and even complex inpainting much easier. Turning quirky and creative ideas into fun illustrations could be more accessible than ever, even for non-creative types.

If you’ve used Photoshop’s “context-aware fill” feature, you know it’s pretty good for replacing simple objects in a clear background. But with DALL-E 2, we can replace such objects with freshly invented images. For example, we can ask the tool to replace a picture of a cracked bowl on a kitchen counter with a flower vase.

DALL-E 2 can even create light, shadow, and angle-based variations of an image. It's easy to envision how that could save time for designers and artists. Its ability to generate realistic faces could also come in handy.

However, there's a long way to go before technologies like the DALL-E 2 can fully replace human designers. There are several limitations in the current system that make it unfit for universal use.

DALL-E 2 struggles with many concepts, particularly those involving shadow, light, or object placement. For example, when asked to create images of a red cube atop a blue cube, it can sometimes show the red cube underneath the blue cube. It also often produces incorrect shadows and fails to make objects with distinct borders.

It has another fundamental weakness: its identification capabilities can fail from simple mislabeling.

"If it's taught with objects that are incorrectly labeled, like a plane labeled "car," and a user tries to generate a car, DALL-E may create a plane. It's like talking to a person who learned the wrong word for something." — OpenAI (Source)

Unless these weaknesses are eliminated, the image generation capabilities of DALL-E 2 for art and design will remain severely limited. But there several possibilities for this technology in the future.

The Future of DALL-E 2

DALL-E 2 hasn't yet been released to the public, but OpenAI hopes to do so by the summer of 2022. For now, only a small group of testers, which includes OpenAI employees, academics, and creatives, has non-commercial access to it. There are various reasons for this limited release, but the primary one is that the technology needs more safeguards.

For example, even though OpenAI has had success in preventing DALL-E 2 from producing adult or violent images, OpenAI has not been able to eliminate other types of bias.

If you ask the tool to generate images of lawyers, it'll give many pictures of middle-aged Caucasian males dressed like lawyers. It also propagates racial and communal stereotyping (for example, by frequently associating images of terrorism and violence with Muslims). Similar stereotypes can be seen for other prompts as well.

OpenAI, in a bid to prevent harm to women especially, prevented DALL-E from creating sexual content. However, this led to the system generating fewer images of women in general.

OpenAI acknowledges these shortcomings, too, but will release DALL-E 2 to more people soon, partly in an effort to learn more about the weaknesses and biases of the model and address them. For now, it prevents Dall-E 2 from generating hateful content, images of public figures, political content, and violent images.

Here are the key takeaways from our exploration of DALL-E 2:

DALL-E 2 could be a great time-saving tool for artists, designers, architects, or anyone even remotely interested in digital art. However, users need to be wary of the model’s potential shortcomings and biases.

DALL-E 2 is wholly better than the first version of DALL-E. It is a step towards increased artificial general intelligence. But there's a long way to go before the technology can replace what designers do today.

While OpenAI has done a fine job mitigating the risks of DALL-E 2 generating images with adult or violent content, trying to remove overall bias by educating the AI more should be its next goal.

OpenAI’s announcement of DALL-E 2 will incentivize other AI companies to develop and release their own commercial versions of similar models quickly.

As we keep learning more about this technology and test it in the coming weeks, we'll tell you more about it.

Subscribe to get full access to the newsletter and website. Never miss an update on major trends in AI and startups.

Here is a bit more about my experience in this space and the two books I’ve written on unsupervised learning and natural language processing.

You can also follow me on Twitter.

Ankur’s Newsletter

Discussion about this post