Seven years ago, back in 2015,  one major development in AI researchÂ
was automated image captioning. Machine learning algorithms couldÂ
already label objects in images,  and now they learned to put those labelsÂ
into natural language descriptions. And it made one group of researchers curious. What if you flipped that process around? We could do image to text. Why not try doing text toÂ
images and see how it works? It was a more difficult task.They didn’t want  to retrieve existing imagesÂ
the way google search does. They wanted to generate entirely novel scenes
that didn’t happen in the real world. So they asked their computer model for
something it would have never seen before. Like all the school buses you've seen are yellow. But if you write “the red or green school bus”Â
would it actually try to generate something green? And it did that. It was a 32 by 32 tiny image. And then all you could see is like aÂ
blob of something on top of something. They tried some other prompts like “A herdÂ
of elephants flying in the blue skies”. “A vintage photo of a cat.” “A toilet seat sits open in the grass field.” And “a bowl of bananas is on the table.” Maybe not something to hang on your wallÂ
but the 2016 paper from those researchers  showed the potential for what mightÂ
become possible in the future. And uh... the future has arrived. It is almost impossible to overstate how farÂ
the technology has come in just one year. By leaps and bounds.
Leaps and bounds. Yeah, it's been quite dramatic. I don’t know anyone whoÂ
hasn’t immediately been like “What is this? What is happening here?” Could I say like watching waves crashing? Party hat guy. Seafoam dreams. A coral reef.
Cubism. Caterpillar. A dancing taco. My prompt is Salvador Dali paintingÂ
the skyline of New York City. You may be thinking, waitÂ
AI-generated images aren’t new. You probably heard about this generated portraitÂ
going for over $400,000 at auction back in 2018. Or this installation of morphing portraits,Â
which Sotheby’s sold the following year. It was created by Mario Klingemann, whoÂ
explained to me that that type of AIÂ Â art required him to collect a specific dataset ofÂ
images and train his own model to mimic that data. Let's say, Oh, I want to create landscapes,Â
so I collect a lot of landscape images. I want to create portraits,Â
I trained on portraits. But then the portrait model would notÂ
really be able to create landscapes. Same with those hyper realisticÂ
fake faces that have been plaguing  linkedin and facebook – those come from aÂ
model that only knows how to make faces. Generating a scene from any combination of wordsÂ
requires a different, newer, bigger approach. Now we kind of have these hugeÂ
models, which are so huge that  somebody like me actually cannot trainÂ
them anymore on their own computer. But once they are there, they areÂ
really kind of— they contain everything. I mean, to a certain extent. What this means is that we can nowÂ
create images without having to actually  execute them with paint orÂ
cameras or pen tools or code. The input is just a simple line of text. I'll get to how this tech works later in the video  but to understand how we got here,Â
we have to rewind to January 2021 When a major AI company called Open AI announcedÂ
DALL-E – which they named after these guys. They said it could create images from textÂ
captions for a wide range of concepts. They recently announced DALLE-2, which promisesÂ
more realistic results and seamless editing. But they haven’t releasedÂ
either version to the public. So over the past year, a community ofÂ
independent, open-source developers  built text-to-image generators out of otherÂ
pre-trained models that they did have access to. And you can play with those online for free. Some of those developers are now workingÂ
for a company called Midjourney, which created a Discord community with bots thatÂ
turn your text into images in less than a minute. Having basically no barrier to entry toÂ
this has made it like a whole new ballgame. I've been up until like twoÂ
or three in the morning. Just really trying to change things, piece things together. I've done about 7,000 images. It’s ridiculous. MidJourney currently has a wait-list forÂ
subscriptions, but we got a chance to try it out. "Go ahead and take a look." “Oh wow. That is so cool” “It has some work to do. I feel like it canÂ
be — it’s not dancing and it could be better.” The craft of communicatingÂ
with these deep learning  models has been dubbed “prompt engineering”. What I love about promptingÂ
for me, it's kind of really  that has something like magic where you have toÂ
know the right words for that, for the spell. You realize that you can refineÂ
the way you talk to the machine. It becomes a kind of a dialog. You can say like “octane render blender 3D”. Made with Unreal Engine... ...certain types of film lenses and cameras... ...1950s, 1960s... ...dates are really good. ...lino cut or wood cut... Coming up with funny pairings, like a Faberge Egg McMuffin. A monochromatic infographic poster aboutÂ
typography depicting Chinese characters. Some of the most striking imagesÂ
can come from prompting the model  to synthesize a long list of concepts. It's kind of like it's having a very strangeÂ
collaborator to bounce ideas off of and get  unpredictable ideas back. I love that! My prompt was "chasing seafoam dreams," which is a lyric from the Ted Leo and the Pharmacists' song "Biomusicology." Can I use this as the album cover for my first album? "Absolutely." Alright. For an image generator to be able toÂ
respond to so many different prompts, it needs a massive, diverse training dataset. Like hundreds of millions of images scraped fromÂ
the internet, along with their text descriptions. Those captions come from things like the alt textÂ
that website owners upload with their images,  for accessibility and for search engines. So that’s how the engineersÂ
get these giant datasets. But then what do the models actually do with them? We might assume that whenÂ
we give them a text prompt,  like “a banana inside a snow globe from 1960." They search through the training dataÂ
to find related images and then copy  over some of those pixels. ButÂ
that’s not what’s happening. The new generated image doesn’tÂ
come from the training data,  it comes from the “latent space”Â
of the deep learning model. That’ll make sense in a minute, firstÂ
let’s look at how the model learns. If I gave you these images and told you to matchÂ
them to these captions, you’d have no problem. But what about now, this isÂ
what images look like to a  machine just pixel values for red green and blue. You’d just have to make a guess, andÂ
that’s what the computer does too at first. But then you could go throughÂ
thousands of rounds of this  and never figure out how to get better at it. Whereas a computer can eventually figure out aÂ
method that works- that’s what deep learning does. In order to understand that this arrangementÂ
of pixels is a banana, and this arrangement  of pixels is a balloon, it looks for metrics thatÂ
help separate these images in mathematical space. So how about color? If we measureÂ
the amount of yellow in the image,  that would put the banana over here and theÂ
balloon over here in this one-dimensional space. But then what if we run into this: Now our yellowness metric isn’t veryÂ
good at separating bananas from balloons. We need a different variable. Let’s add an axis for roundness. Now we’ve got a two dimensional space with theÂ
round balloons up here and the banana down here. But if we look at more data we may comeÂ
across a banana that’s pretty round,  and a balloon that isn’t. So maybe there’s some way to measure shininess. Balloons usually have a shiny spot. Now we have a three dimensional space. And ideally, when we get a new image weÂ
can measure those 3 variables and see  whether it falls in the banana regionÂ
or the balloon region of the space. But what if we want our model to recognize,  not just bananas and balloons,Â
but…all these other things. Yellowness, roundness, and shininess don’tÂ
capture what’s distinct about these objects. That’s what deep learning algorithms doÂ
as they go through all the training data. They find variables that help improve theirÂ
performance on the task and in the process,  they build out a mathematical spaceÂ
with way more than 3 dimensions. We are incapable of picturing multidimensionalÂ
space, but midjourney's model offered this and I like it. So we’ll say this represents the latent space ofÂ
the model. And It has more than 500 dimensions. Those 500 axes represent variables thatÂ
humans wouldn’t even recognize or have  names for but the result is thatÂ
the space has meaningful clusters: A region that captures the essence of banana-ness. A region that represents the texturesÂ
and colors of photos from the 1960s. An area for snow and an area for globesÂ
and snowglobes somewhere in between. Any point in this space can be thoughtÂ
of as the recipe for a possible image. The text prompt is what navigates us to thatÂ
location. But then there’s one more step. Translating a point in that mathematicalÂ
space into an actual image involves a  generative process called diffusion.Â
It starts with just noise and then,  over a series of iterations, arranges pixelsÂ
into a composition that makes sense to humans. Because of some randomness in the process,  it will never return exactly theÂ
same image for the same prompt. And if you enter the prompt into aÂ
different model designed by different  people and trained on differentÂ
data, you’ll get a different result. Because you’re in a different latent space. No way. That is so cool. What the heck? The brushÂ
strokes, the color palette. That’s fascinating. I wish I could like — I mean he’s dead,Â
but go up to him and be like, "Look what I have!" Oh that’s pretty cool. Probably theÂ
only Dali that I could afford anyways.” The ability of deep learning to extractÂ
patterns from data means that you can copy an  artist’s style without copying their images,Â
just by putting their name in the prompt. James Gurney is an American illustrator who  became a popular reference forÂ
users of text to image models. I asked him what kind of norms he would likeÂ
to see as prompting becomes widespread. I think it's only fair toÂ
people looking at this work  that they should know what the promptÂ
was and also what software was used. Also I think the artists should be allowedÂ
to opt in or opt out of having their work  that they worked so hard on by hand be usedÂ
as a dataset for creating this other artwork. James Gurney, I think he was aÂ
great example of being someone  who was open to it, startedÂ
talking with the artists. But I also heard of other artistsÂ
who got actually extremely upset. The copyright questions regardingÂ
the images that go into training the  models and the images that come outÂ
of them…are completely unresolved. And those aren’t the only questionsÂ
that this technology will provoke. The latent space of these models contains some  dark corners that get scarier asÂ
outputs become photorealistic. It also holds an untold numberÂ
of associations that we wouldn’t  teach our children but thatÂ
it learned from the internet. If you ask an image of the CEO,Â
it's like an old white guy. If you ask for images ofÂ
nurses, they're all like women. We don’t know exactly what’s in theÂ
datasets used by OpenAI or Midjourney. But we know the internet is biased towardÂ
the English language and western concepts,  with whole cultures not represented at all. In one open-sourced dataset,  the word “asian” is represented firstÂ
and foremost by an avalanche of porn. It really is just sort of an infinitely complexÂ
mirror held up to our society and what we  deemed worthy enough to, you know, putÂ
on the internet in the first place and  how we think about what we do put up. But what makes this technology soÂ
unique is that it enables any of  us to direct the machine toÂ
imagine what we want it to see. Party hat guy, space invader, caterpillar, and a ramen bowl. Prompting removes the obstacles between ideasÂ
and images, and eventually videos, animations,  and whole virtual worlds. We are on a voyage here, thatÂ
is it's a bigger deal than  than just like one decade or theÂ
immediate technical consequences. It's a change in the way humans imagine,Â
communicate, work with their own culture  And that will have long range,Â
good and bad consequences that we  we are just by definition, not going toÂ
be capable of completely anticipating. Over the course of researching this video
I spoke to a bunch of creative people who have played with these tools. And I asked them what they think this all means for people who make a living making images. The human artists and illustrators and designers and stock photographers out there. And they had a lot of interesting things to say. So I've compiled them into a bonus video. Please check it out and add your own thoughts in the comments. Thank you for watching.
This is jaw-dropping.
The evolution of AI is a little scary but so cool
Wow! This is pretty impressive tech. It's another powerful demonstration of just how far AI has come. And with more refinements, it'll keep improving. Who knows what that might lead to with respect to graphic design, public art, and entertainment?
For me personally, I'm intrigued by this technology because I am terrible at drawing. I've always been better with words and rhetoric, but when it comes to actually drawing something, I'm terrible. If technology like this could help put some of my ideas into a detailed rendering, that would be incredible. I would love to be able to turn some of those thoughts into something tangible that I could share. And I imagine many others out there with brilliant ideas for art could do the same.
I'm looking at this the rest of Mona Lisa image and thinking that fuck, the AI will be able to lie to us perfectly. It will be able to tell us exactly what we want and need to hear. When the AI will reach this level of story telling with human audio the scam calls are going to wreck havoc.
As soon as I saw the prompting and creation, I immediately thought of the scenes from ST:TNG where the characters were prompting the holodeck for some arbitrary scene.
It's hard to know how AI will develop in general, but it could pose an extreme threat to humanity.
This short vox article provides a good introduction to the risk:
https://www.vox.com/future-perfect/2019/2/12/18202466/ai-artificial-intelligence-humanity-threat
This is actually magic. We live in a world of miracles.
is there like a website for this
So.... can you use this for porn?
Seems like it has incredible potential. Especially for cartoon porn.
And if it is arbitrarily flagged against doing it.. who's gonna come up and make an engine that will do it, and make lotsa dollars?