Paraphrasing and simplifying here, but this approach sort of works as follows:
It starts with a field of noise. Then, through a series of iterative steps it tries to reverse the noise back into image by selecting from a group of input functions (themselves created by processing a large collection of input images with associated keywords - mostly scraped off the internet), weighted by the user's prompt keywords. If you don't run enough iterations it just produces chaos.
Prompt:
close up face shot of a beautiful young woman, cute face, tacticool, sci-fi, cables, digital illustration, line art by yoji shinkawa and masamune shirow'
1 Iteration:
2 Iterations:
4 Iterations
5 Iterations:
10 Iterations:
50 Iterations:
The entire approach is
strongly influenced by the breadth and quality of the source data (called the "model").
Edit: You can see some of the locality at work by slightly changing the prompt and keeping the same seed. I just added the word
sunglasses to the above prompt and got this:
You can see that while the overall shape of the image (determined largely by the first few iterations) is the same, but the addition of the sunglasses elements clearly brought in a different input for a lot of nearby face details, so the mouth and chin structure ended up completely different despite having no different guidance there. Also the tops of the frames are.... weird, probably where it's slightly indecisive w.r.t the sunglasses being a better match for the original dark eyes.
Screwing with the weighting of the
sunglasses keyword by calling it
sunglasses!!!!!!!!! seems to influence the weighting but that doesn't mean it makes better sunglasses (because computers don't know anything about the elements they're bringing in), it apparently just increases the weighting of image functions that contain the sunglasses keyword (which itself doesn't mean anything, because the keyword->function association is only as good as the original tagging for the source images).
Disturbing results ensue:
Overall, it's a neat trick, but it's not magic.